The 100% Reliability Targets Fallacy
A new CTO at a SaaS company, one that was building toward a Series B and trying to demonstrate enterprise-grade reliability, put it to me directly in a standup. “Our customers expect the system to never go down. Our SLO has to reflect that. We’re targeting 100%.” I could see the faces of a couple of engineers on the call change. None of them spoke. Within six months, their SRE program had quietly stopped using SLOs at all. The target everyone knew was meaningless had poisoned the targets that weren’t.
That’s the cost of the 100% conversation. When the headline number is a fiction, the supporting math becomes theatre, the error budget becomes a polite suggestion, and the engineering judgment that keeps the system available has nowhere to land.
non-zero failure
probability"] end subgraph Economics["The economic argument"] E["Cost grows exponentially;
user benefit plateaus
past 4-5 nines"] end subgraph SRE["The program argument"] S["Zero error budget
hollows out the
SRE program"] end P --> C["100% SLO is
logically indefensible"] E --> C S --> C style C fill:#fdd style P fill:#fff5e0 style E fill:#fff5e0 style S fill:#fff5e0
Figure 1. Three independent arguments converge on the same conclusion. Any one of them is enough. The impossibility argument is the shortest. The program argument is the most expensive to learn by experience.
The asymptote of nines
Reliability is described in nines. Two nines is 99% available. Three is 99.9%. The pattern continues, and what it costs to chase each additional nine doesn’t scale linearly.
| Reliability | Nines | Downtime per year | Relative engineering effort |
|---|---|---|---|
| 90% | 1 | 36d 12h | 1x |
| 99% | 2 | 3d 15h | 10x |
| 99.9% | 3 | 8h 46m | 100x |
| 99.99% | 4 | 53m | 1,000x |
| 99.999% | 5 | 5m | 10,000x |
| 99.9999% | 6 | 31s | 100,000x |
| 100% | infinite | 0s | infinite |
The effort multipliers are approximate. The shape of the curve isn’t. Each additional nine costs roughly an order of magnitude more engineering effort than the previous one. The relationship is power-law, not linear.
- Premise one: engineering effort grows exponentially with each additional nine.
- Premise two: every organization’s resources are finite.
- Conclusion: at some point, additional nines stop being affordable. The question isn’t whether the cap exists. It’s where it sits.
The user-experience curve
The cost curve is half the picture. The other half is the perceived benefit. A move from 90% to 99% reduces annual downtime from 36 days to 3.5 days, and users notice that. A move from 99.999% to 99.9999% reduces it from 5 minutes a year to 31 seconds. Users cannot notice that. The variance in their network connection, their device, the third-party services they depend on alongside yours: all of it dwarfs the difference.
- Premise one: perceptible improvement diminishes with each additional nine.
- Premise two: at some point, typically beyond 4 or 5 nines, the difference becomes imperceptible.
- Conclusion: beyond a certain threshold, additional reliability provides no measurable value to users.
The threshold isn’t universal. Voice systems with hard latency budgets care about more nines than batch reporting systems do. Trading systems care more than internal dashboards. The point is that the threshold is a function of the workload, not a number that should default upward indefinitely.
Why 100% is impossible, not just hard
The asymptote argument settles “very hard.” It doesn’t settle “impossible.” The impossibility argument is shorter and harder to dispute.
- Premise one: every system exists in a physical universe with non-zero probabilities of component failure.
- Premise two: every system depends on hardware, networks, power, and external services with non-zero failure rates.
- Premise three: the probability that the whole system is working equals the product of the probabilities that each component is working.
- Conclusion: any real-world system has a non-zero probability of failure at any given moment.
This isn’t economics or engineering. It’s straightforward modus ponens. Given a probabilistic universe, perfect reliability requires violating physical law.
Hardware faults occur. Software has bugs the test suite didn’t catch. Operators make mistakes under pressure. Upstream dependencies fail in ways their own SLOs didn’t predict. Combinations of failures occur that nobody modeled, because nobody can model the full combinatorial space of a system at production scale.
Redundancy reduces the probability but doesn’t drive it to zero. Two redundant components, each with 99.99% reliability, give you roughly 99.999998% combined, which is impressive and still not 100%. The math forbids the destination.
The economic argument, even if physics didn’t apply
Set the impossibility aside. Pretend 100% reliability were physically achievable. The economic argument still rules it out.
- Premise one: engineering resources have opportunity costs. Time spent on the next nine is time not spent on features, security, performance, or paying down architectural debt.
- Premise two: each additional nine costs roughly 10x the previous one.
- Premise three: user-perceived benefit diminishes at every step.
- Conclusion: at some point, the marginal cost of additional reliability exceeds its marginal benefit.
Where that point sits is specific to the system, the user base, the regulatory environment, and the alternative uses of engineering time. For most products it’s between 99.9% and 99.99%. For truly critical systems, between 99.99% and 99.999%. Beyond 99.999%, the only honest argument is regulatory: a contract or a law requires it. Even then, the cost should be visible to the people setting the requirement.
What 100% does to the SRE program
The damage isn’t just wasted engineering effort. The deeper damage is what 100% does to the rest of the reliability program.
A 100% SLO implies a zero error budget. A zero error budget means every change is a risk against the SLO, every experiment is a transgression, every deployment becomes adversarial to reliability. The team can’t ship safely, can’t run chaos exercises, can’t take the calculated risks that build durable reliability into a system over time. Innovation becomes guilty.
A 100% SLO also makes the SLO unenforceable. The first incident burns the whole budget instantly. The dashboard goes red and stays red. The signal that was supposed to drive feature-versus-reliability tradeoffs becomes constant alarm noise, and the team learns to ignore it. The most expensive failure mode in an error budget program isn’t running out of budget. It’s the team agreeing, implicitly, that the budget number wasn’t real.
That SaaS company I mentioned went through both of these. The 100% target stayed on the slide. The team running operations learned to manage outages without referencing it. Inside two quarters, the SLO conversations and the reliability conversations were happening in different meetings, with different numbers, and nobody was reconciling them. The headline target had hollowed out the program that was supposed to enforce it.
Google’s SRE writing makes this point plainly: target reliability based on user need and business requirement, not on technical perfection. The recommendation is structural, not stylistic. Targets that exceed what users need create incentives that work against the goals the targets were meant to serve.
Setting a target that holds up
The question that should replace “what’s our SLO” is “what’s the smallest amount of reliability our users will tolerate, and at what cost.” The pattern that holds up across the teams I’ve seen get this right is a conversation with three honest parts.
First, the user-need question. What level of unavailability would push the user to a competitor, a complaint, or a contract dispute?
| Domain | Typical SLO range | What failure costs |
|---|---|---|
| Internal tooling | 99% – 99.5% | Engineer frustration |
| Consumer products | 99.5% – 99.9% | User churn |
| B2B SaaS | 99.9% – 99.99% | Contract risk |
| Payment rails | 99.99% – 99.999% | Revenue loss, regulatory |
| Clinical systems | 99.99% – 99.999% | Patient safety |
Second, the cost question. What does the next nine require in engineering effort, infrastructure spend, vendor commitments, and on-call burden? The honest answer is rarely smaller than 5x the previous nine. The team that can produce a credible cost estimate is the team that can have a real conversation with finance about what reliability is worth.
Third, the tradeoff question. What gets cut to fund the next nine? Features that drive revenue. Security work. Architectural cleanup. Performance investment. The next nine isn’t paid for by hope. It’s paid for by something specific. Naming the something is the discipline. I’ll admit this part rarely lands cleanly in a single meeting, because the people who want the nines and the people who fund the tradeoffs are often not in the same room.
The math that comes out of those three questions is straightforward to express:
# Realistic SLO: the math works
slo:
service: payment-api
target: 99.99%
error_budget: 0.01% # 53 minutes per year
burn_rate_1h_threshold: 14.4 # pages if sustained
burn_rate_6h_threshold: 6 # pages if sustained
budget_exhaustion_at_14x: "~5 days"
budget_drives: "deploy freeze, feature hold, reliability sprint"
# 100% SLO: the math collapses
slo:
service: payment-api
target: 100%
error_budget: 0% # no budget exists
burn_rate_threshold: "∞" # any error is infinite burn
first_incident: "budget exhausted; dashboard permanently red"
result: "team learns to ignore the signal"
For most products, the math produces 99.9% to 99.99%. Three nines for the workload, four nines for the platform underneath. The error budget that follows is large enough to absorb learning, small enough to drive discipline, and honest enough that the team will use it. Teams that get the target right relative to their observability maturity end up with a budget they can defend, which is the only kind worth setting.
The honest version of “we want it always up”
The CTO who asked for 100% wasn’t wrong about what his customers wanted. He was wrong about what his customers’ wants implied for the engineering target. Customers want a system that’s always up. They also want features that are always shipping, costs that are always falling, and security that’s always improving. No engineering organization delivers all four simultaneously. The job of the SLO is to make the tradeoff explicit instead of pretending it doesn’t exist.
Once the impossibility argument landed, the question changed. Not “how do we get to 100%,” but “what do our customers expect.” The answer turned out to be in the 99.9s for the consumer-facing site and a nine higher for the payment rails. Both numbers were achievable. Both had costs the team could estimate. Both produced error budgets large enough to be useful. The SLO program restarted a few months later with targets the team would hold themselves to.