The Three Stages Every IDP Goes Through

Tue, Oct 15, 2024
10-minute read

Most platform teams are not platform teams. They’re an engineering team that has been told to write a platform without being told to run one. The two jobs share a name, and they’re separated by a set of disciplines no one wrote into the team’s charter.

The pattern is so consistent it has stages, and the stages aren’t about technology. They’re about what the team has to learn next. Almost every team I advise is somewhere on this arc: the first platform that nobody used, the second that stalled at 70%, or the third where adoption is wide and the team is in burnout because nobody named the discipline shift the platform’s success demanded.

flowchart LR Stage1["Stage 1
no users
built for imagined
developer"] Stage2["Stage 2
users but wrong
abstractions
adoption stalls 70%"] Stage3["Stage 3
must be run
as a product
SLAs, versioning,
roadmap"] Stage1 --> Stage2 --> Stage3 style Stage1 fill:#fdd style Stage2 fill:#fff5e0 style Stage3 fill:#eaf2fa

Figure 1. The three stages of an IDP. Each stage requires a discipline the previous one didn’t, and the transitions are where most platforms stall. The teams that survive stage three usually owe it to leadership recognizing they’re running a product, not building infrastructure.

The first platform: the one with no users

Almost every IDP starts here. A platform team is funded, often after a leadership conversation about developer productivity. They build a self-service portal. They ship golden paths for the workflows they think developers want. They demo at the all-hands. Six months later, adoption is concentrated in two friendly teams, and the rest of engineering is using whatever they were using before.

The mistake at this stage isn’t technical. The technology usually works. The mistake is that the platform was designed for a developer who doesn’t exist. The “golden path” was the path the platform team thought engineers should want, not the one they walked.

A mid-size SaaS I worked with, one that was trying to standardize deployments across a fast-growing engineering org, spent half a year building a service-deployment portal. The post-mortem was clear once we ran it. Their portal asked engineers to learn three new abstractions to deploy a service they were already deploying with one. The platform was technically elegant and operationally a tax. Nobody opted in to a tax. Adoption sat in the low double digits after six months, and most of that was the platform team’s own services.

The way out is unglamorous. Pick three teams. Sit with their engineers. Watch them ship. Take notes on the parts they curse at, not the parts they ask for. Engineers ask for things they already know how to want. The interesting requirements live in the friction they’ve stopped noticing because they’ve worked around it for two years.

Your platform’s first version should solve a problem you watched somebody have, not a problem you assumed they had. The teams that get this stage right tend to look like they’re moving slowly, because they spend the first few months in other people’s codebases. The teams that look like they’re moving fast at this stage are usually building the platform that won’t get used.

The second platform: the one with users but the wrong abstractions

You learn from the first failure. You start with users. You ship golden paths your beta users want. Adoption climbs to 60%, then 70%. And then it stalls.

What’s happening is that the abstractions are leaking. Engineers are learning the platform and the underlying system, because every non-trivial use case requires understanding what the platform is hiding. The golden path is good for 70% of cases. The other 30% require dropping out of the platform entirely, and the drop-out experience is painful enough that senior engineers route around the platform on principle. The grudging usage is the canary. When the most experienced engineers tell you they “use it because they have to,” you’ve stopped winning their adoption and started taxing their patience.

A mid-size SaaS I advised hit this exactly. The IDP had strong adoption for greenfield services, where engineers had no muscle memory to fight against. Teams operating legacy systems disliked it, because the platform’s all-or-nothing model meant their existing deployment patterns had no path forward. We added escape hatches: components consumable a la carte instead of as a bundled platform. Adoption climbed in those teams not because we’d improved the golden path, but because we’d made the off-path experience livable.

The work at this stage is mostly diagnostic. Audit which abstractions earn their keep. Some hide complexity that would otherwise tax every engineer in the org. Others hide complexity that engineers have already learned to handle and resent re-learning through your interface. Build honest escape hatches. The off-path experience matters more than the golden-path experience for senior engineers, because senior engineers are the ones who decide whether the platform is credible.

Measure what gets adopted, what gets abandoned, and what gets used grudgingly. The first two are visible. The third requires asking and listening. It’s also the most useful signal you’ll get.

The discipline this stage builds is the same discipline that distinguishes an IDP that helps from one that just creates overhead. The abstractions either earn their keep or they accrue interest in a currency the platform team rarely sees: senior engineers’ patience.

The third platform: the one that has to be operated as a product

You’ve fixed the abstractions. You have wide adoption. The platform now serves 200+ engineers across 15+ teams. And the platform team’s job has changed in ways nobody saw coming.

You’re running a product, with users, SLAs, support obligations, and a roadmap that has to balance feature work against bug fixes against migration work. The platform’s failure modes are now production failure modes for everyone who depends on it. A breaking change in a core abstraction is a coordination problem with 15 teams.

The platform teams I’ve seen reach this stage arrived with “engineers, mostly.” A year into operating at product scale, the gaps that showed up weren’t technical. PMs, deprecation windows, and a platform SLO were the missing pieces, and adding them was an org admission, not an architecture decision.

What this stage requires breaks into four areas. A real product manager, not a tech lead acting as one part-time — the skills for running a product with internal customers (roadmapping, deprecation comms, balancing competing requests) are not the skills you hired for when you funded an infrastructure team. Versioning, deprecation cycles, and migration support: “we’re moving everyone to the new abstraction next quarter” is not a strategy; it’s a pre-incident. An SLO for the platform itself, with consequences when you breach it — the platform is now the thing that breaks production for everyone else when it breaks, and it deserves to be treated that way. And a funding model that survives reorgs and budget cycles: funding a platform project-to-project means the next executive transition rebuilds it from scratch, with the cost borne by every team that depends on it.

The transition into this stage is harder than it looks because product discipline is not a tooling decision. It’s a hiring decision, a funding decision, and a reporting-line decision. Most platform teams arrive at stage three with a tech-lead culture and try to operate a product with that culture. The misalignment shows up in the work, not in the architecture diagrams.

What each stage demands

The table below compares the three stages across the dimensions that separate success from failure. The hard admission: most platform teams discover they’re in stage three by experiencing the burnout, not by anticipating it.

Dimension	Stage 1	Stage 2	Stage 3
User discovery	Assumed; built for imagined developer	Beta users involved; real feedback	Continuous; NPS-style signals and office hours
Abstractions	Opinionated bundles	Escape hatches added after stall	Formally versioned; deprecation windows
Failure mode	Low adoption; teams use workarounds	Senior engineers route around; grudging usage	Platform outage = production outage for all
Funding model	Project budget	Project budget, often extended	Persistent product budget; survives reorgs
Team shape	Engineers only	Engineers + occasional UX research	Engineers + PM + designer + SRE
SLO	None	Informally tracked	Formal SLO with error budget and breach reviews

A platform catalog entry that reflects stage three

A Stage 3 platform publishes its services in a software catalog so dependent teams can discover, assess, and track dependencies without asking the platform team. The Backstage-style entry below demonstrates the operational metadata that separates a catalog entry from mere documentation: ownership, lifecycle, SLO tier, and the versioning contract.

# catalog/components/deploy-service.yaml
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: deploy-service
  title: Deploy Service
  description: >
    Golden-path deployment orchestrator for all production services.
    Wraps Kubernetes rollout, image signing, and Argo CD sync.
  annotations:
    backstage.io/techdocs-ref: dir:.
    pagerduty.com/service-id: "P3XKQYZ"
    github.com/project-slug: platform-eng/deploy-service
  tags:
    - platform
    - deployments
    - golden-path
  links:
    - url: https://wiki.example.com/platform/deploy-service
      title: Runbook
    - url: https://status.example.com/deploy-service
      title: SLO Dashboard
spec:
  type: service
  lifecycle: production
  owner: group:platform-engineering
  system: internal-developer-platform

  # Versioning contract: consumers pin to a major version.
  # Minor versions are backward-compatible.
  # Major version changes require 90-day migration window.
  providesApis:
    - deploy-service-api-v2

  dependsOn:
    - resource:argocd-production
    - resource:image-registry
    - resource:secrets-manager

  slo:
    tier: critical          # platform outage = production impact
    availability: "99.9%"
    latency_p99: "2s"
    error_budget_policy: "freeze-feature-work-on-breach"

  deprecation:
    v1_sunset_date: "2025-03-01"
    migration_guide: https://wiki.example.com/platform/deploy-service/v1-to-v2

The cross-cutting concerns that distinguish a platform from a tooling collection

Five concerns don’t belong to any single stage. They make the difference between a platform and a tooling collection, and they’re often where the GitOps and IaC patterns you’ve already built either pay off or fail to.

Identity: the platform is the surface where developer workspace identity meets the production identity model. If those two are unconnected, the platform is just a UI in front of someone else’s authorization decisions. Secrets: the platform handles the lifecycle (rotation, audit, scoped access), not just the storage. A platform that hands an engineer a secrets vault and walks away has solved 10% of the actual problem.

Observability: the platform provides the instrumentation that the services it deploys will need. If teams have to assemble observability separately for every service, the platform isn’t reducing the cognitive load it claimed to reduce. Cost: the platform makes cost visible at the level of decision-making, not at the level of the invoice. Telling a team they spent tens of thousands of dollars last month is not actionable; showing them the per-deploy cost at the moment they’re choosing instance sizes is.

Compliance: the platform encodes the policies, so individual teams don’t relearn them every quarter. A compliance posture that depends on every team being trained quarterly is a compliance posture that drifts.

A platform without these is a CLI with branding. A platform with them is the surface that makes the rest of the engineering org’s work tractable.

Reading the platform’s job description

flowchart TD Charter["Platform team
charter"] --> Q1{"Job description
says 'build
infrastructure'?"} Q1 -->|Yes| Q2{"Team is
doing product
management?"} Q1 -->|No| Aligned["Charter and
work aligned"] Q2 -->|Yes| Gap["Gap = next
two years of
pain"] Q2 -->|No| Safe["Charter and
work match"] Gap --> Fix["Rewrite charter
hire PM
reframe funding"] style Aligned fill:#eaf2fa style Safe fill:#eaf2fa style Gap fill:#fdd style Fix fill:#fff5e0

Figure 2. The gap between what the team’s charter says and what the work requires is rarely surfaced until burnout or attrition forces the conversation. The platform teams that survive stage three are the ones where leadership read the job description against the actual work before a crisis made it unavoidable.

The platform teams that make it through eventually hire a real product manager, redraw their funding model, and stop calling themselves an infrastructure team. Adoption usually stays where it was. Burnout goes down. The next executive transition doesn’t trigger a rebuild, because the platform is now visible to leadership as a product the org depends on.

The IDP that succeeds isn’t the one with the best technology. It’s the one whose team learned, through enough painful iteration, to operate the thing they built. The teams that don’t learn run a fourth platform, and a fifth, and the org wonders why platform engineering “doesn’t work here.”

Read your platform team’s job description against what they’re doing. If the description says “build infrastructure” and the work is product management, that gap is not a framing problem. It’s a funding problem, a hiring problem, and a reporting-line problem, and the team is carrying all three of them right now.

platform platform strategy