The Error Budget Mindset

Tue, Apr 22, 2025
6-minute read

“I don’t think it’s ever caused us to slow shipping. We’ve raised the SLO several times when we couldn’t hit it.” That’s how a platform director at a large SaaS company answered when I asked when a budget breach had last changed what her team did. The dashboards were tidy. The alerts were tuned. On paper, a textbook implementation.

That’s the failure mode, and it’s not the math. Error budget management only works in organizations willing to slow down when the budget is exhausted. The mechanism gets built. The alerts fire. And when the budget runs out, the org keeps shipping anyway, because shipping is what the org rewards. If the consequence has never landed, you don’t have an error budget program. You have an availability dashboard with an extra calculation on it.

What the error budget is

The error budget is the complement of the SLO: if your SLO is 99.9%, your budget is 0.1% of measurable downtime over the window. It’s computed from the SLO and service behavior, not assigned as a target to spend down. Treating it as something to allocate against feature work converts it into a financial metaphor and obscures what it’s actually for. It’s also a shared resource: the platform team’s risky migration consumes the same budget as the product team’s risky deploy, and without explicit allocation, the budget gets consumed by whoever moves first.

The piece most programs skip: the budget has consequences when it runs out, or it isn’t a budget. That consequence has to be agreed on in advance, in writing, by people senior enough to enforce it under pressure. Improvising after the breach doesn’t work. The breach is exactly when the org’s bias toward shipping is strongest.

The patterns that survive contact with reality

Most teams start with raw availability alerts and end up muting them. A single-window alert can’t tell a brief spike from sustained degradation, so it pages constantly and eventually gets ignored. The two-window approach fixes that: a short window catches the severity, a longer one confirms the rate is real before waking anyone up:

# Fires when projected to exhaust the 30-day budget within ~5 days
- alert: ErrorBudgetBurning
  expr: |
    job:slo_error_rate:1h  > 14.4 * slo_error_budget and
    job:slo_error_rate:6h  > 6    * slo_error_budget
  for: 2m
  labels:
    severity: page
  annotations:
    summary: "Burn rate elevated — ~5 days to budget exhaustion"

The discipline is keeping alerts pruned over time. An alert that pages every week gets muted.

The quarterly review is the meeting where you find out whether the program is real. A mid-size SaaS I advised, a team that had done the work to earn a mature SLO program, held this one without exception. When one product team had exhausted their budget by mid-quarter, the team accepted a month-long feature freeze and used the time on a long-deferred reliability project. The freeze held because finance had committed, in writing, to evaluating that team on reliability metrics that quarter, not just feature throughput. Without that commitment, the freeze would have been negotiated away in the second week.

The release freeze is the hardest consequence to keep credible. Use it too often and it loses its teeth. Never use it and it’s a fiction nobody takes seriously. The middle ground (once or twice a year, with leadership backing) is worth aiming for.

How budget programs decay

In the first year, everything looks like progress. SLOs get set, alerts fire when they should, dashboards are clean. Then a breach happens, the consequence doesn’t land, and the program starts cutting corners. Four shapes show up repeatedly, often in sequence.

The most common shape: the budget gets measured and never enforced. Dashboards keep updating, but the consequence conversation never happens because the leadership that would have backed it has moved on. The SLO gets silently raised, which is even harder to catch: the 99.9% becomes 99.5% the next quarter, then 99% the quarter after, with no explicit conversation about user impact. Ownership drifts: engineering manages the metric, product ignores it, finance never sees it. Then the burn-rate alert fires constantly and gets muted one notification at a time until the system is silent. The program is dead. The dashboard hasn’t noticed.

flowchart TD Start[Program adopted] --> Year1[Year 1: alerts wired,
dashboards built,
leadership engaged] Year1 --> Breach{First breach} Breach -->|Lands| Real[Real program] Breach -->|Dodged| Decay[Decay begins] Decay --> Raise[SLO silently raised] Decay --> Mute[Alerts muted] Decay --> Drift[Ownership drifts] Raise --> Dashboard[Availability dashboard
with extra steps] Mute --> Dashboard Drift --> Dashboard Real --> Quarterly[Quarterly negotiation,
consequences land,
roadmap shifts]

Figure 1. The fork is the first breach. The org that holds the consequence becomes a real program. The org that dodges it begins the slide toward an availability dashboard with the appearance of a budget.

Where programs live or die

Most organizations claim to want reliability and behave like they want features. The error budget program forces that contradiction into the open, which is why the program is often quietly defanged before it does the uncomfortable work.

The question that has to be answered before the first breach: what is your organization willing to pay for reliability? Pay in feature deferral. Pay in headcount. Pay in telling a customer you’ll ship that feature later because you need stabilization time. If the answer is “we want reliability without paying anything for it,” the budget program will fail in a predictable way. Better to know that going in.

Error budgets work when the cost of breach is felt by the people making the trade. If the product team gets credit for shipped features and no penalty for budget exhaustion, the trade is rigged. If the platform team gets blamed for outages with no credit for the freezes that prevent them, it’s rigged the other way. The finance partner has to be in the room. Without it, the conversation upward never connects the technical mechanism to the business consequence.

That’s also the hardest part to assess from the outside. You can have the right conversations and still not know whether the commitment will hold until the moment it’s tested. That company got there because the head of engineering and the CFO had agreed, before the year started, that reliability would count toward the bonus pool. Most programs don’t do that work, and the freeze gets negotiated away the second time it comes up.

The thing the dashboard isn’t

Error budgets are a tool for honesty about priorities. They work in organizations willing to be honest about those priorities, which is fewer than the ones currently claiming to be doing error budget management.

If your organization has error budgets but has never slowed shipping when one was breached, you don’t have error budget management. You have an availability dashboard with an extra calculation. The conversation about what to do when the budget runs out is the work. It has to happen before the first breach, in calm conditions, with leadership backing. After the breach, every incentive in the org pushes toward the path that quietly dissolves the program.

That platform director knew what she was looking at. Whether her org was willing to do the work that follows is a different question, and the answer wasn’t going to be settled in a meeting. It would be settled by the next breach, and what she and her CFO and her CEO were willing to commit to before it happened. That’s where this work lives, not in the dashboard but in the room where the SLO and the consequence get negotiated together, in advance, with the budget treated as a real promise, not a number to raise when it’s inconvenient.

reliability slo reliability