CI/CD Pipeline Architecture for Infrastructure

Thu, Dec 8, 2022
7-minute read

A SaaS company, a team that had done application CI/CD well for years, extended the same discipline to infrastructure. Pull request, review, merge, auto-apply. The pattern works for application code, so they assumed it would work for infrastructure. The first production incident came at 3am: an IAM change revoked the engineering team’s own access to the production account. The reviewer hadn’t noticed the policy diff hidden inside a several-hundred-line plan. Recovery took most of a working day, because the recovery path required the access that had just been revoked.

Application CI/CD is well-trodden. The patterns are mature, and most teams have a working answer to “how does code get to production.” Infrastructure CI/CD is younger, the patterns are still being negotiated, and the failure modes look unlike anything from the application world. Applying application patterns directly produces predictable failures: auto-apply on merge with no human review, rollback strategies that don’t roll back, blast radii that surprise everyone the first time something goes wrong.

Why infrastructure CI/CD is different

Application code and infrastructure code share a syntax, often the same VCS, often the same review tooling, and almost nothing else.

The deepest difference is state. State changes don’t roll back the way a binary swap does. You can re-apply an old plan, but the resources you destroyed are still destroyed. The new ones have different IDs. Anything that referenced the old IDs is now broken. “Roll back to the previous version” assumes a version concept that doesn’t apply to a database that’s now empty, which means infrastructure rollback sits closer to disaster recovery than to deployment rollback. Teams that haven’t internalized this end up writing rollback documentation that reads correctly and doesn’t work in practice.

Blast radius is the second difference, and the more expensive one when miscalculated. An IAM change affects every workload in the account. A networking change can take down services that didn’t deploy. The mental model from application deployment, where blast radius equals the service being deployed, is the wrong frame. Change cadence makes this harder to gate consistently: networking might ship quarterly, IAM ships monthly, application configs ship daily. A pipeline that treats a daily config change and a quarterly VPC change identically either becomes ceremonial for the daily changes or too permissive for the quarterly ones.

The pipeline shape that works

flowchart LR PR[PR opened] --> Plan[Plan stage
posts plan output
as PR comment] Plan --> Review[Human review
plan output is the artifact,
not the diff] Review --> Approve{Approved?} Approve -->|Yes| Gate[Manual gate
for high-blast-radius
changes] Approve -->|No| PR Gate --> Apply[Apply stage] Apply --> Verify[Post-apply verification] style Plan fill:#eaf2fa style Gate fill:#fff5e0

Figure 1. A safe infrastructure pipeline. The plan stage produces a reviewable artifact. Human review reads the plan, not the diff. The manual gate exists for high-blast-radius changes. Post-apply verification catches the changes that succeeded technically and broke something operationally.

The plan output is the reviewable artifact, not the source diff. A two-line code change can produce a 200-resource plan. The reviewer needs to see the plan, not just the change.

Not every change type warrants the same gate. The blast radius varies too much:

Change type	Blast radius	Appropriate gate
IAM policy	Account-wide	Manual review + apply approval
Networking / VPC	Region-wide	Manual review + apply approval
Security group	Service + dependencies	Plan review required
App config / env var	Single service	Automated approval
Tagging / metadata	Cosmetic	Auto-apply after plan

A mid-size SaaS I worked with built a pipeline that posted plan output as a PR comment, required a reviewer who wasn’t the author, and added a manual gate only for changes touching IAM or networking. Production incidents from infrastructure changes dropped from a few per year to roughly one. The gating wasn’t heavier overall. It was placed where the blast radius warranted.

Environment promotion happens through state, not through merge alone. Merging to main isn’t the deploy; it’s the artifact. The deploy is a separate, environment-aware step with its own state. Conflating the two is how a dev pipeline becomes a prod pipeline by adding a target, and the prod pipeline inherits a review process designed for dev.

The pipeline shapes that fail

Auto-apply to production on merge. The failure isn’t “if” but “when.” You’ll eventually merge a change that looks fine and isn’t. The apply will happen before anyone reads the plan. The company I opened with had this exact pipeline. The single 3am incident cost more than a year’s worth of manual approval steps.

The single-environment pipeline bolted onto multiple environments. The dev pipeline became the prod pipeline by adding a target. The dev pipeline didn’t have IAM in scope. The prod pipeline does, and your dev-pipeline review process is now reviewing prod IAM changes with reviewers who haven’t been briefed on what they’re looking at.

The pipeline that conflates application and infrastructure changes. A single PR that changes both a service binary and the IAM role attached to it. Different review needs, different rollback paths, different blast radii. Reviewing the union as one artifact means the application reviewer doesn’t read the IAM diff and the IAM reviewer doesn’t read the application diff. Both ship together.

The pipeline that depends on engineer-only access to debug. When the on-call can’t see what the pipeline is doing because the logs live in a system only engineers can reach, the first 3am incident surfaces this. The second time, the on-call is paged for problems they can’t investigate, and recovery time is whatever it takes to wake up an engineer.

The review surface

What’s being reviewed matters more than how it’s conducted.

Policy-as-code at the apply boundary catches the cases humans miss. Mechanical rules free reviewers to look at intent:

# OPA/Rego: deny any IAM policy granting all actions on all resources
deny[msg] {
  input.resource_changes[_].change.after.statement[_].actions[_] == "*"
  input.resource_changes[_].change.after.statement[_].resources[_] == "*"
  msg := "IAM policy grants *:* — requires explicit sign-off"
}

The pipeline that runs policy-as-code at the apply boundary, not just pre-merge, also catches the click-ops change that bypassed review entirely.

Plan diffs that are too large to read are an architecture problem, not a review problem. A 2,000-resource plan is a sign your modules aren’t isolated. Adding reviewers fixes the symptom; the cause is upstream in how the state files are split.

The reviewer’s actual job is spotting the unintended consequence: the change that looks fine and isn’t. The unintended consequence usually lives in the plan output, not the source diff. A reviewer who reads the source carefully but skims the plan will miss it. This is where the IAM blast radius problem shows up in pipelines: an IAM change buried in a 400-line plan is the easiest unintended consequence to miss, and the most expensive when it ships.

The drift problem

The pipeline isn’t the only thing that changes infrastructure. Humans change it through the console. Other automation touches it. Other pipelines in adjacent systems touch it. The state file diverges from reality.

Detection, reconciliation, and alerting are three different responses. Detection alone surfaces drift but doesn’t fix it. Reconciliation fixes it but can be unsafe: re-applying state can destroy resources that were created out-of-band for a deliberate reason. Alerting is the safer middle ground: surface drift to humans, let them decide.

Continuous reconciliation that destroys out-of-band resources is the same failure pattern as auto-apply. It removes the human at exactly the moment the human’s judgment is the actual control.

What the pipeline is for

An infrastructure pipeline is the control plane for your organization’s ability to change things safely. Design it carelessly and you’ll find out which assumptions you made during the first production change that goes wrong, usually at 3am, usually with leadership watching.

The pipeline that holds up treats plan output as a reviewable artifact, environment promotion as an architectural decision, and rollback as a feature tested before the first incident. Gates belong where the blast radius warrants them. The company I opened with rebuilt their pipeline over the following quarter: plan-as-artifact, manual gate for IAM and networking, post-apply verification. Infrastructure incidents dropped to near zero in the following year. The pipeline they ended up with isn’t more sophisticated. It’s more honest about what infrastructure changes are.

platform infrastructure platform