IAM Sprawl Is Access Debt

Tue, Nov 15, 2022
8-minute read

Three years into your cloud journey, you run an IAM inventory. The spreadsheet has hundreds of roles. Dozens have administrative permissions. A handful are attached to service accounts whose original purpose nobody can explain. A few belong to users who haven’t logged in for over a year. The rest belong to projects that may or may not still exist; the only way to know is to grep through Terraform you weren’t sure you owned.

IAM is the part of your cloud footprint that grows fastest and gets cleaned up the slowest. Every project adds roles. Every integration adds policies. Every migration leaves a trail of permissions someone meant to revoke and never did. The result looks like a security control on the surface and increasingly behaves like a liability underneath.

How sprawl accumulates

The accumulation pattern is consistent. Each event adds permissions. Almost nothing removes them.

The most reliable of the four is the P0 elevation that never gets revoked: someone needs broad access to fix a production issue, the access gets granted as temporary, and the temporary becomes permanent because nobody scheduled the revocation. By the third incident in a quarter, the same engineer has accumulated three elevated paths, none cleaned up. A close variant is the human-turned-service-account: a real person’s IAM principal becomes a service account because nobody had time to do the proper service-identity work, and the principal keeps the human’s permissions long after the human has left. A mid-size SaaS I advised, one that kept tight code review discipline but had never applied it to IAM, had a service account named after a former engineer who’d been gone for years, with admin access to the production data store. Nobody had revoked it because nobody knew what depended on it. The cleanup surfaced several jobs that had been silently authenticating as the departed engineer.

The other two shapes are slower. The “just for now” overbroad policy gets granted broader-than-necessary scope to unblock a project, the project ships, and the exception ages into the policy as the new normal: the team that inherits it a year later assumes the broad scope is the intended scope, because nothing in the code or the comments suggests otherwise. Acquisitions add another vector: you inherit an IAM model from a smaller company, the migration gets scheduled for “after the integration,” and two years in the integration is still ongoing. The acquired roles still exist, none of them owned by anyone still at the company.

flowchart LR Day1[Day 1
50 roles
clean model] Year1[Year 1
hundreds
incidents grant
some] Year2[Year 2
several hundred
service accounts
proliferate] Year3[Year 3
several hundred
acquisition adds
more] Year4[Year 4
too many
to audit cleanly] Day1 --> Year1 --> Year2 --> Year3 --> Year4 style Year3 fill:#fff5e0 style Year4 fill:#fdd

Figure 1. The growth pattern of IAM in a typical cloud environment. The role count rarely declines. The audit pressure shows up around year four, by which point the cleanup is much larger than the cumulative cost of preventing the sprawl would have been.

This is one of the cleanest examples of governance debt you’ll find in cloud infrastructure. Each individual decision was reasonable. The cumulative position is one no architect would have approved if asked.

The shape of the debt

Once accumulated, IAM debt has recognizable patterns:

Pattern	What it looks like	Blast radius	Why cleanup stalls
Role with unknown purpose	Name suggests one thing, policy another, usage a third	Varies	Nobody wants to be the one who breaks something
Overbroad policy	`:` or cross-account access that outlived its use case	Account-wide	Narrowing requires usage data nobody has
Phantom trust relationship	Cross-account trust for a use case that ended	Cross-account	Invisible until a migration or audit surfaces it
“Everyone is admin somewhere”	No single account has sprawl; union of accounts does	Full org	Blast radius only visible when you compute it

The unifying property is that IAM debt is invisible until you measure it. The roles exist. The cleanup doesn’t appear on any team’s roadmap because no team owns IAM as a system.

Why cleanup is hard

Removing access is high-risk. The day after revocation is when something breaks, usually not the thing you expected. You revoke a service account; an integration nobody knew about used the same role; a critical job fails silently for 36 hours. The first time this happens, the cleanup project gets paused. The second time, it gets canceled.

The data needed to clean up often doesn’t exist. Usage data, ownership, original intent. The roles created by a contractor who left a year ago don’t have an owner, and the access logs needed to prove they’re unused require enabling logging that wasn’t on at the time. You can’t audit what you didn’t log.

There’s also a political cost to asking teams to justify their access. The conversation reads as accusatory even when it isn’t. The team that doesn’t respond fast enough has its access revoked, which causes an incident, which gives the rest of the org a reason to push back on the cleanup. The project ends because the political capital ran out, not because the work finished. I’ve seen this stop cleanup efforts more often than any technical obstacle has, and I’ll admit it took me a few engagements to start treating it as the primary risk from the beginning.

Patterns that prevent

The pattern that holds up is shifting from cleanup to non-accumulation.

Just-in-time access for elevated permissions. Engineers request elevation for a window measured in hours, not weeks. The access expires by default. Nothing has to be revoked because nothing was granted permanently. AWS IAM supports time-limited conditions natively:

{
  "Effect": "Allow",
  "Action": "sts:AssumeRole",
  "Resource": "arn:aws:iam::*:role/prod-admin",
  "Condition": {
    "DateLessThan": { "aws:CurrentTime": "${expiry}" },
    "StringEquals": { "aws:RequestedRegion": "us-east-1" }
  }
}

A SaaS company I worked with, trying to get a SOC 2 renewal without triggering an engineering revolt, rolled out JIT elevation for production access over more than a year and eliminated the bulk of standing admin permissions. Engineering pushback was significant for the first quarter, then fell to near zero as the workflow matured.

Workload identity replacing service accounts. Workloads authenticate as workloads, not as long-lived credentials. The long tail of “service accounts that aren’t really service accounts” disappears. The credential rotation problem disappears with it, because there’s no credential to rotate.

Periodic access review with consequences, not just attestation. Quarterly reviews where teams defend access, not click “still needed” on a 200-row spreadsheet. The reviews that work have explicit consequences for unjustified access. The reviews that don’t are attestation theater whose main output is a screenshot for the auditor.

IAM as code, with review. Click-ops accumulated over years is the substrate of sprawl. Code review at the apply boundary catches most of it before it becomes sprawl. This is also where IAM intersects with the structure of your IaC codebase: when IAM lives in a state file shared with networking and compute, reviewers stop reading carefully because the changes are too large to review.

The honest cleanup approach

If you’re already in the sprawl, sequencing matters. Doing it in the wrong order surfaces incidents and stalls the project.

Start with the highest-blast-radius access, not the easiest wins. The roles with cross-account or admin scope are the ones that matter for the next breach. Cleanup of “small” roles is satisfying. It doesn’t change your posture.

Use access logs to inform decisions, not assumptions. If a role hasn’t been used in 90 days, it’s a strong revocation candidate. Without logs, every revocation is a guess, and a percentage will break something.

Accept that some cleanup will surface broken integrations, and have a runbook ready. The first round of revocations will break things you didn’t know existed. That’s diagnostic data, not a reason to stop. The team that stops at the first incident will still be doing this cleanup in three years.

Make cleanup a continuous practice, not a project with a beginning and end. Sprawl is what accumulates between cleanups. The goal is to reduce the time between them until the cleanup is small enough to be routine, the way a healthy team treats certificate rotations.

The team that asked me about IAM sprawl didn’t ask because they wanted to clean up the inventory. They asked because their auditor had. Half a year later, they’d retired well over half their roles, named owners for the rest, and built a quarterly review that took days instead of weeks. None of that was glamorous. It changed the answer to “what’s the blast radius if a credential leaks?” from “we don’t know” to a number small enough to defend.

If you can’t answer “who can access this data?” without an audit project, the blast radius of a leaked credential is wider than you think. Treat IAM as a system with its own lifecycle and ownership, not as a side effect of every other system’s deployment.

security security