The IDP That Helps, and the One That Just Adds Overhead

Tue, Dec 12, 2023
10-minute read

The dashboard counted CLI invocations as portal usage, so the platform’s adoption number kept going up while the portal itself kept being avoided. Nobody was lying. The platform team wasn’t measuring the wrong thing on purpose. The thing they were measuring just wasn’t the thing that would have told them whether the platform was doing work.

That’s the pattern in the wild. A platform exists, the metrics look healthy, the leadership readouts are clean, and the engineers it was built for quietly route around it. The difference between an IDP that helps and one that just creates overhead is rarely the technology underneath. It’s the discipline of building for actual users, talking to them after launch, and being willing to delete the features that aren’t earning their keep.

flowchart TB subgraph Useful["The useful version"] UA[Small number of paths
deeply supported] UB[Real beta users
before broad launch] UC[Honest measurement
of time saved] end subgraph Overhead["The overhead version"] OA[Many features
shallow support] OB[Built for an
imagined developer] OC[Adoption metrics
that count activity] end Useful -.->|"same tooling,
different
discipline"| Overhead style Useful fill:#eaf2fa style Overhead fill:#fff5e0

Figure 1. Two platforms with the same tools and very different outcomes. The fork happens early, in the shape of “who is this for and how do we know if it’s working.” Most teams that end up on the right of the diagram started with the same intent as the teams on the left and lost the discipline somewhere between RFC and v1 launch.

What a useful IDP feels like, from the developer side

The signal isn’t a portal full of features. It’s that the time from idea to running code dropped, the context-switching to remember which tool does what dropped, and the defaults are sane enough that most cases don’t require thinking. When a developer needs something the defaults don’t support, the escape hatch is honest: you can drop down to the underlying primitives without the platform team blocking you, fighting you, or marking you as off-path in a way that costs you politically.

The quieter signal is what developers say to each other unprompted. In the teams where the IDP is doing real work, you hear “I wish my last company had this.” In the teams where it isn’t, you hear nothing, or you hear “the portal” used as a verb that means “the thing I have to do before I can do my actual work.”

Useful IDPs treat their developers as the customer. The team building the platform watches them use it, asks them what’s slow, and fixes those things first. That sounds obvious. It’s the part most platform teams don’t do, because it’s harder than building features and the feedback loops require admitting which features were wrong.

What an unhelpful IDP feels like, from the developer side

The unhelpful version stacks an extra layer on top of the systems your engineers still have to know. They learn the platform’s abstractions and they still have to learn Kubernetes, Terraform, and the cloud underneath because the abstractions leak whenever something breaks. The “self-service” workflow has a button that opens a ticket. The “golden path” is narrower than what most teams do, so the people on it are doing toy work and the people doing real work are off it.

The deeper failure is conceptual. The platform was built for an imagined developer who fits the assumptions baked into the abstractions. When real developers turn out to want different shapes of system, the platform team treats them as wrong, not as feedback. The “we’re going to standardize on the golden path” decree is usually a sign that the path is bad and the team in charge of it doesn’t want to know.

A mid-size SaaS, one that had been scaling rapidly and had grown from three platform-consumer teams to fourteen, shipped over a dozen platform features in a year. An internal survey found only a handful were actively used, several had been used once and abandoned, and the rest had never been adopted by anyone outside the platform team. The team’s leadership readout reported the feature count. The engineering organization had absorbed the cost of integrating with all of them because the platform team had political authority to demand it. That’s the textbook overhead version. Nobody had done anything malicious. The discipline of asking “is this earning its place” had simply not been applied.

What produces useful platforms

The patterns that hold up are well-known and rarely followed. A small number of opinionated paths, deeply supported, beats a large number of shallow ones, because the team can debug the paths it ships when something breaks. Real beta users need to be involved before broad launch, picked from teams that don’t owe the platform team favors and won’t soften their feedback to be polite. And the platform team needs product discipline: PMs or PM-shaped people, designers or designer-shaped people, feedback loops, and willingness to ship a feature only to teams who asked for it.

The hardest discipline is honest measurement. I’ll admit I haven’t seen many teams sustain this well: it requires admitting, repeatedly, that some of what you shipped is costing more than it’s saving. The platform team I respect most measures time saved per developer per week, broken down by team, and it doesn’t include time the platform itself consumed. They don’t count CLI invocations. They don’t count portal logins. They count “what did the developer do this week, and how much of that was the platform helping versus the platform being a tax.” When a feature shows up as a tax, they kill it.

That mechanism is the one most platform teams don’t have because it requires acknowledging that some of what they shipped was wrong. It’s also the mechanism that distinguishes the platforms that compound from the ones that calcify.

What produces overhead

Four patterns show up repeatedly. Building for an imagined developer — one who matches the platform team’s prior experience, not the actual engineering org’s daily reality. Abstracting capabilities that developers were already comfortable using directly, on the assumption that abstraction is automatically good. Compliance and governance requirements disguised as platform features, shipped to engineers who experience them as friction, not capability. And ownership without product staffing: the platform was built by infrastructure operations, expected to behave like a product, given neither the budget nor the staffing to do that.

The compliance-as-platform pattern is the hardest to fix. The platform team gets pulled in to enforce a security or compliance requirement. The path of least political resistance is to add it to the platform: “we’ll have the platform require this.” Developers experience the requirement as platform overhead. The security team is no longer in the room. Everyone is unhappy and nobody is sure how to undo the arrangement.

Anti-pattern	Root cause	Developer experience	How it persists
Imagined developer	Platform team’s own priors	“This doesn’t fit our workflow”	No user research loop
Abstraction without value	Assumption that wrapping is improvement	Extra layer with same complexity	Hard to measure what was removed
Compliance as feature	Political pressure to enforce requirements	Friction disguised as capability	Security team leaves the room
Mandate without accountability	Authority without product staffing	“The thing I must do before real work”	No mechanism to prove value

The honest signals

The numbers worth tracking are all the ones that resist gaming. Adoption rate segmented by team, because uniform adoption with one outlier team is different from concentrated adoption with one team carrying the platform. Time-to-onboard for a new service, measured by stopwatch, not by self-report. Developer satisfaction, asked directly with structured surveys that the platform team doesn’t write, not inferred from feature uptake. Off-path usage rate, because high off-path usage means the golden path doesn’t fit and low off-path usage means it probably does.

The metric the platform team will instinctively want to report is the one that isn’t on this list: feature count. Feature count tells you nothing about whether the platform is doing work. A platform team that ships fewer features and runs more direct conversations with their users will produce more value than one that ships features at a steady cadence regardless of demand.

The discipline of deletion

The most underrated platform skill is killing features. Not deprecating, not graduating, not leaving them up because somebody might still be using them. Killing them. A platform that can’t kill features will, over a few years, accumulate a long tail of half-supported surface area that the team has to maintain forever, that new engineers have to learn around, and that obscures the paths that work.

A SaaS platform team I worked with ran a feature audit each quarter. They listed every platform capability, the teams currently using it, the cost to maintain, and the value being produced. Anything that ranked low on use and high on cost went on a six-week deprecation track. The team who’d shipped the original feature ran the deprecation. The discipline meant the platform got smaller in directions it should have. The directions that mattered got room to grow.

The platform team that can’t run that audit honestly is the one whose IDP becomes overhead at scale. Not because the engineers are bad, but because no system stays useful if nothing ever leaves it.

Here is the catalog entry format that team used to make each feature’s standing legible:

# catalog-info.yaml
# Backstage-compatible component descriptor.
# Every platform capability gets one of these.
# Quarterly audit: anything scoring red on value/cost goes on deprecation track.

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: service-provisioning-workflow
  description: >
    Self-service workflow for spinning up a new backend service
    with opinionated defaults (Kubernetes, Postgres, internal auth).
  annotations:
    # Audit metadata — updated each quarter
    platform.company.com/last-audit: "2023-10-01"
    platform.company.com/audit-owner: "Platform Engineering"
    platform.company.com/value-rating: "green"    # green / yellow / red
    platform.company.com/cost-rating: "green"
    platform.company.com/active-consumer-teams: "7"
    platform.company.com/last-used: "2023-11-28"
    platform.company.com/time-saved-hrs-per-week: "4.2"
    platform.company.com/deprecation-status: "none"
spec:
  type: platform-capability
  lifecycle: production
  owner: platform-engineering
  system: internal-developer-platform
  dependsOn:
    - component:kubernetes-cluster
    - component:internal-auth-service
  providesApis:
    - service-provisioning-api

---
# Example of a capability on deprecation track
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: legacy-env-manager
  description: >
    Environment variable management UI. Superseded by
    external-secrets-operator integration in Q2.
  annotations:
    platform.company.com/last-audit: "2023-10-01"
    platform.company.com/audit-owner: "Platform Engineering"
    platform.company.com/value-rating: "red"
    platform.company.com/cost-rating: "yellow"
    platform.company.com/active-consumer-teams: "1"
    platform.company.com/last-used: "2023-09-12"
    platform.company.com/time-saved-hrs-per-week: "0.1"
    platform.company.com/deprecation-status: "in-progress"
    platform.company.com/deprecation-due: "2024-01-15"
    platform.company.com/deprecation-owner: "team-infrastructure"
spec:
  type: platform-capability
  lifecycle: deprecated
  owner: platform-engineering
  system: internal-developer-platform

flowchart TD Launch([Feature shipped]) --> M1{Active users
after 60 days?} M1 -->|No| D1[Add to
deprecation watch] M1 -->|Yes| M2{Time saved
positive?} D1 --> D2{Still no users
at 90 days?} D2 -->|Yes| Kill[Deprecation
track: 6 weeks] D2 -->|No| M2 M2 -->|No: net tax| Fix{Fixable in
one sprint?} M2 -->|Yes| Keep[Keep and
invest further] Fix -->|Yes| Patch[Ship fix,
re-measure] Fix -->|No| Kill style Kill fill:#fdd style D1 fill:#fff5e0 style Keep fill:#eaf2fa style Patch fill:#fff5e0

Figure 2. The feature lifecycle decision that separates platforms that compound from ones that calcify. Most teams skip the measurement step entirely, which means features that are net taxes never reach the deprecation track.

Where the value comes from

The platform engineer at the SaaS I opened with eventually ran her own audit, half a year after launch. The portal had shipped with over a dozen features and roughly that many service templates. Only a few were saving developer time. A couple had emerged from listening to the engineers already using the CLI. The rest went into a “what is this protecting” review, and most of them shrunk. The portal got smaller. The dashboard stopped counting CLI invocations as portal usage.

The question I keep sitting with after these engagements isn’t whether the platform team had the technical skill. They almost always did. It’s whether the organization gave them permission to say “this feature isn’t working” before it became a political fact on the ground.

platform platform