IaC Architecture at 5, 50, and 500 Engineers

Tue, Jun 14, 2022
8-minute read

At several dozen engineers, a SaaS company I worked with, one that had been growing its platform team faster than its tooling, had a single Terraform state file that took nearly twenty minutes to plan. Engineers from three teams kept colliding on the apply lock. The platform team was visibly burning out from rebasing each other’s changes. The story everyone in the org told was “Terraform doesn’t scale.” What didn’t scale was their architecture.

The pattern I see consistently: the pain teams attribute to their tooling is mostly architecture pain. Module boundaries grew without a plan. State files became too valuable to touch. The blast radius nobody mapped is the one that surfaces during the first cross-account incident, and it’s never the radius the runbook assumed.

IaC architecture is downstream of team size. The codebase that worked at five engineers won’t work at fifty, and the one that worked at fifty won’t survive your next hiring wave. The shapes aren’t interchangeable. Most of the pain you’ll encounter is the pain of running a smaller team’s architecture at a larger team’s scale.

flowchart LR Five[5 engineers
one repo, one state] --> Fifty[50 engineers
state split by lifecycle] Fifty --> FiveHundred[500 engineers
IaC as a product
versioned modules] style Five fill:#eaf2fa style Fifty fill:#fff5e0 style FiveHundred fill:#eaf2fa

Figure 1. The three IaC codebase shapes. Each one is correct at its scale and broken at the others. The hardest transitions happen when the team grows past 50 and again past 200, because the pattern that worked yesterday is the pattern that breaks tomorrow.

At 5 engineers: one repo, one state

One repository, one state file, every engineer can apply, and the review process is whoever’s on Slack. It works. Don’t apologize for it.

A five-person platform team I advised had a few thousand lines of Terraform in one repo, a single AWS account, a couple of Slack-mediated approvals a week. Nobody had grepped for an unfamiliar resource in months. Their architecture was correct.

Three things to get right at this stage. Pin your provider versions first; the day you don’t, you’ll find out why. Lock state remotely from day one: the cost is a few minutes, and the cost of recovering from concurrent applies takes much longer. Keep modules thin and obvious. Compose, don’t abstract.

The trap here isn’t sophistication. It’s the opposite. Engineers who’ve worked in larger codebases will reach for patterns that pay off at a scale you don’t have yet, and you’ll end up with a 50-engineer codebase that five people are using. Generic modules with twelve input variables. Abstraction layers that hide the AWS resource you need to debug. A wrapper repo with another wrapper repo. Each of those decisions was justified somewhere. The cumulative effect is a codebase nobody can read until it grows into the shape it was prematurely designed for.

IaC architecture at 50 engineers: split state by lifecycle

The single state file becomes the constraint. Plans take tens of minutes. Engineers from different teams collide on the same lock. A single bad apply disrupts unrelated services.

This is where you split state. The honest split is by lifecycle, not by team or service. Networks change rarely. Pipelines change daily. Put them in the same state file and every pipeline change carries network blast radius for no reason.

flowchart TB subgraph Slow["Slow-changing"] Net[Networking
VPCs, peering, DNS] Sec[Security baseline
IAM roots, KMS] end subgraph Medium["Medium cadence"] Data[Data infrastructure
databases, queues] Compute[Compute platforms
clusters, ASGs] end subgraph Fast["Fast-changing"] CI[CI/CD pipelines] App[Application configs] end style Net fill:#eaf2fa style Sec fill:#eaf2fa style Data fill:#fff5e0 style Compute fill:#fff5e0 style CI fill:#fdd style App fill:#fdd

Figure 2. State split by lifecycle. The slow-changing layer is amended quarterly; the fast-changing layer is amended daily. Putting them in the same state file means every CI change is reviewed against the same blast radius as a VPC change, and the review process either becomes unusable or becomes ceremonial.

That SaaS company I opened with: after splitting their state into network, data, compute, and CI lifecycles, average plan time dropped from nearly twenty minutes to under two. The lock contention disappeared the same week. The platform team’s burnout went down before any new tooling did.

The other change at this scale is the review process. Slack-and-vibes worked at five engineers. At fifty, you need plan output as a reviewable artifact: the diff in a PR comment, a real human eyeballing it before apply. Auto-apply on merge at this scale is a story you’ll tell at the post-incident review.

Here is what the lifecycle split looks like in practice. The pattern that holds is to express the dependency direction explicitly, not leave it to convention:

# environments/prod/networking/main.tf
# Slow-changing layer — reviewed before any other state changes.
# Other layers depend on outputs from this one; never the reverse.

module "vpc" {
  source  = "../../modules/vpc"
  version = "2.4.0"

  cidr_block         = var.vpc_cidr
  availability_zones = var.azs
  name               = "prod-core"
}

output "vpc_id" {
  value       = module.vpc.vpc_id
  description = "Consumed by compute and data layers via remote state."
}

# environments/prod/compute/main.tf
# Medium-cadence layer — reads networking outputs via remote state.
# Never imports from the fast-changing layer.

data "terraform_remote_state" "networking" {
  backend = "s3"
  config = {
    bucket = "acme-tf-state-prod"
    key    = "prod/networking/terraform.tfstate"
    region = "us-east-1"
  }
}

module "eks_cluster" {
  source  = "../../modules/eks"
  version = "1.9.0"

  vpc_id     = data.terraform_remote_state.networking.outputs.vpc_id
  subnet_ids = data.terraform_remote_state.networking.outputs.private_subnets
  name       = "prod-workloads"
}

IaC architecture at 500 engineers: IaC as a product

The constraint is no longer state file size or review throughput. It’s coordination. Three teams want to change the same VPC for different reasons in the same week. The shared module that 100 services depend on can’t ship a breaking change. The new hire can’t safely apply anything for their first six months because the blast radius of any apply is unclear from the code.

The shift is to treat IaC as a product. The platform team owns shared modules. The rest of engineering consumes them with version numbers, changelogs, deprecation notices, and a roadmap. Internal customers pin versions, don’t read the source unless something breaks, and expect upstream to behave like a maintained dependency. The same etiquette you’d use with a third-party vendor library, applied internally.

At a large SaaS company, the platform team shipped shared modules with semver, organized dozens of separate state files by domain and lifecycle, and put policy-as-code at the apply boundary. A meaningful share of plans got blocked before they reached a reviewer. I’ll admit the line between “genuine security improvement” and “a metric leadership wanted to see” was blurry here; the policy program benefited from both. But the discipline held, and the number gave the program room to continue.

Teams that try to scale IaC past 500 engineers without this discipline end up with the platform team as the bottleneck for every change in the org. That’s the moment IaC starts getting blamed for problems that are coordination problems.

How the three shapes compare

Each shape is a coherent system. The problems arise when a team inherits one shape but operates at a different scale. The comparison below surfaces the inflection points:

Dimension	5 engineers	50 engineers	500 engineers
State structure	Single file, single account	Split by lifecycle (4–6 files)	Domain-and-lifecycle matrix (20+ files)
Module strategy	Thin wrappers, no abstraction	Shared modules, no versioning yet	Versioned modules with changelogs
Review process	Slack approval	PR with plan output	Policy-as-code at apply boundary
Apply ownership	Any engineer	Designated applier per layer	Platform team owns shared; teams own app
Primary failure mode	Premature abstraction	Monolithic state, lock contention	Coordination bottleneck, blast-radius opacity

The table shows why the 50-engineer transition is the hardest. The 5-engineer shape stops working visibly, but it stops working all at once: plan times, lock collisions, and incident blast radius compound together and the cause is unmistakable. The 500-engineer shape, by contrast, often doesn’t break cleanly. Coordination problems arrive as friction before they arrive as outages.

Every IaC codebase carries exceptions

Bootstrap code. The legacy account that pre-dates the codebase. The glue between IaC and the configuration management you haven’t replaced yet. Every IaC codebase has places where the rules don’t apply.

The failure mode isn’t having exceptions. It’s pretending you don’t. An exception module named like a regular module, hidden in a directory called misc, with no comment explaining why it exists, breaks the codebase’s legibility for the next person who has to read it. New hires assume the patterns they see are the patterns they should follow. They’ll copy the exception code into a new context, and the exception will multiply.

Name your exceptions. Put them in a directory called legacy or bootstrap. Write a README that says what they’re for and what would have to be true to retire them. The honest exception, marked as such, costs much less than the unmarked one that quietly becomes the pattern.

Reading your IaC architecture like a job description

That SaaS company got their plan time back under two minutes. The lock contention disappeared. Within a quarter, several new product teams onboarded without the platform team becoming a bottleneck. None of that was a Terraform improvement. It was them admitting they’d outgrown their architecture and building the next one.

The decisions that worked when you were five people don’t fail visibly at fifty. They just make every change a little slower, every review a little less honest, every apply a little more anxious. By the time the cost is legible, fixing it costs much more than making the right structural move at the right moment would have.

Read your codebase the way you’d read a job description: this is what we built for, this is what we are now. If those two sentences don’t match, the gap is your work. The harder version of that question is whether the people in your organization who could sponsor the transition can see the gap clearly enough to fund it.

platform infrastructure platform architecture