SLO Targets and Observability Maturity

Mon, Mar 18, 2024
9-minute read

A SaaS company I worked with reported a high-99s availability SLO to its board for two years. The number drove planning and reassured a regulator. An internal review found the measurement excluded a sizeable share of traffic traversing an unmeasured legacy proxy. The actual figure was meaningfully lower, in the 99.6 range. Nobody had been deceiving anyone. The dashboard had been built before the proxy existed, and nobody had updated the measurement boundary when the architecture changed.

The textbook approach to SLOs assumes a system you can measure honestly. Most organizations don’t have that. They have a few teams with strong observability, several with patchy coverage, and a long tail of services with dashboards held together by good intentions. Setting SLO targets across that landscape produces SLOs that are aggressive where coverage is thin, conservative where coverage is strong, and meaningless where the signal isn’t trusted.

flowchart TB S0["Stage 0
no measurement"] --> S1["Stage 1
noisy / partial"] S1 --> S2["Stage 2
clean / no
enforcement"] S2 --> S3["Stage 3
full SLO loop"] style S0 fill:#fdd style S1 fill:#fff5e0 style S2 fill:#eaf2fa style S3 fill:#eaf2fa

Figure 1. The four stages of SLO readiness. Most organizations have a mix across all four. The mistake is mandating Stage 3 SLOs across the fleet and treating Stage 0 services as policy violations, when they are a precondition gap that needs to be built before any target is meaningful.

What SLO targets require

SLO targets need four preconditions before a number means anything. Reliable measurement at the boundary the SLO references — not “we have a metric somewhere,” but the right metric, at the right boundary, with confidence in its accuracy. Confidence that the signal isn’t distorted: a 1% sampling rate the team has forgotten about produces SLO numbers that are statistically defensible and operationally useless, and a metric that drops events when the queue is full will report perfect availability during the exact moments the system is under stress. Burn-rate alerting that fires when expected and not otherwise — loud alerts that fire constantly train the on-call to ignore them, and quiet ones that miss real events produce breaches that surface only after the fact. And a team that can act on the SLO: measurement without ownership produces dashboards, not circuit breakers on feature work.

Most teams I’ve worked with have one or two of these. Getting all four is the work that makes SLOs real.

The maturity stages, and the target that fits each one

The mistake is treating SLO maturity as binary. There are four stages, and the right target depends on which one a service is in.

Stage	What it looks like	Right target	Common failure
0	No measurement at the relevant boundary	Don’t set an SLO; build measurement first	Setting any target here produces decorative compliance
1	Measurement exists but is noisy, partial, or unowned	Directional target that drives coverage, not enforcement	Alerting on a stage 1 signal burns out on-call within months
2	Clean measurement, no error budget discipline	Conservative target; build the discipline without firefighting	Skipping conservative phase and jumping to 99.9% too early
3	Full loop: measurement, alerting, error budget, consequences	Tight target matching actual user expectation	Keeping stage 3 targets static as the system scales

Stage 0 is more common than leadership thinks, especially in services that predate the SLO program. Stage 1 is where most “we have an SLO” services live: the metric is plumbed but skips error categories, or depends on a logging pipeline that drops events under load. The dashboard shows a number nobody can fully defend.

Stage 2 is clean measurement without a control loop: the SLO is a thermometer, not a circuit breaker on feature work. Stage 3 is the full loop: measurement, error budget, burn-rate alerting, and consequences when the budget burns. Most organizations have very few services at Stage 3. The ones they do have are usually the ones leadership talks about most.

The mapping isn’t permanent. Services move up the stages. The targets should move with them.

The burn-rate alert that makes Stage 3 real

A burn-rate alert is the enforcement mechanism separating Stage 2 from Stage 3. Without it, the SLO is a number on a dashboard with no circuit breaker on feature work. The example below fires when the error budget is burning at 14x the sustainable rate over the most recent hour, giving the on-call time to respond before the monthly budget is exhausted.

# alerting/burn-rate-slo.yaml
# Prometheus alerting rule for a 99.9% monthly availability SLO.
# Fires when 1-hour burn rate exceeds 14x the budget rate,
# meaning the entire monthly budget would exhaust in ~50 hours.
groups:
  - name: slo.order-service
    rules:
      - alert: OrderServiceHighBurnRate
        expr: |
          (
            sum(rate(http_requests_total{job="order-service",code=~"5.."}[1h]))
            /
            sum(rate(http_requests_total{job="order-service"}[1h]))
          ) > (14 * (1 - 0.999))
        for: 2m
        labels:
          severity: page
          slo: "order-service-availability"
        annotations:
          summary: "Order service burning error budget at >14x rate"
          description: >
            Current error rate {{ $value | humanizePercentage }} exceeds
            the 14x burn-rate threshold for the 99.9% monthly SLO.
            At this rate the full monthly budget exhausts in ~50 hours.
          runbook_url: "https://wiki.example.com/runbooks/order-service-slo"

      - alert: OrderServiceMediumBurnRate
        expr: |
          (
            sum(rate(http_requests_total{job="order-service",code=~"5.."}[6h]))
            /
            sum(rate(http_requests_total{job="order-service"}[6h]))
          ) > (6 * (1 - 0.999))
        for: 15m
        labels:
          severity: ticket
          slo: "order-service-availability"
        annotations:
          summary: "Order service slow burn — investigate before standup"
          description: >
            6-hour burn rate {{ $value | humanizePercentage }} exceeds 6x
            threshold. Budget not critical yet, but trending toward breach.

What happens when you ignore the stages

A mid-size SaaS, one that had recently scaled from a handful of services to several dozen, mandated 99.9% SLOs across all services. Within a quarter, two patterns emerged. Services with strong observability were hitting their targets comfortably, with no pressure to improve. Services with weak observability were missing their targets in ways the team couldn’t explain, because the missing was a measurement artifact, not a reliability problem.

The platform team responded by tightening alerting thresholds, assuming the missing services needed more attention. The on-call rotation for those services started getting paged at all hours for alerts the team couldn’t act on. Two senior engineers left within six months. The SLO program got blamed. The cause was the mandate to apply Stage 3 discipline to Stage 1 services.

The fix was to map every service to a stage explicitly, set targets per stage with a 12-month roadmap to move services up, and accept that some services would never reach Stage 3 because the cost exceeded the value. The handful of services that moved from Stage 1 to Stage 3 over the following quarters absorbed most of the platform team’s investment. A few services stayed at Stage 1 deliberately, with directional targets, because the user impact didn’t justify the investment to move them.

The pattern that holds up is the stage-aware version: messier than the uniform version, but it produces measurements teams trust.

The cross-team conversation

The platform team that wants uniform targets and the product teams with wildly different maturity have a conversation that almost always goes the same way. Platform proposes a uniform standard. Product teams object that their service can’t honestly hit it. Platform responds that the standard is the standard. Product teams comply on paper, set up the SLO with whatever measurement they have, and quietly stop trusting the result.

The honest negotiation is targets per service with a path to maturity, not enforced averages across the fleet. “We have a dozen services at Stage 3, nearly twenty at Stage 2, a handful at Stage 1, and a few at Stage 0. The Stage 1 services have a roadmap to Stage 2 by Q2.” That’s a useful report. “We have 99.94% availability across the platform” is a number that, in this context, hides more than it reveals.

The temptation to mandate uniformity is strong because uniformity reads better in a slide deck. The cost shows up as dashboards nobody trusts and on-call burnout that follows from acting on signals that aren’t real.

The decision flow for a new SLO

flowchart TD Start["New SLO request"] --> Boundary{"Measurement
boundary
defined?"} Boundary -->|No| BuildM["Build measurement
first"] Boundary -->|Yes| Signal{"Signal
trustworthy?"} Signal -->|No| FixS["Fix sampling,
gaps, drops"] Signal -->|Yes| Alert{"Burn-rate
alerting in
place?"} Alert -->|No| Stage2["Set conservative
Stage 2 target"] Alert -->|Yes| Owner{"Team can
act on breach?"} Owner -->|No| Ownership["Establish
ownership first"] Owner -->|Yes| Stage3["Set Stage 3
target"] style Start fill:#eaf2fa style Stage3 fill:#eaf2fa style Stage2 fill:#fff5e0 style BuildM fill:#fdd style FixS fill:#fdd style Ownership fill:#fdd

Figure 2. The four preconditions as a decision flow. Skipping a step doesn’t remove the requirement; it defers it until an incident surfaces it. The most expensive path through this diagram is the one where teams set Stage 3 targets at the top and discover the missing preconditions through alert fatigue and engineer attrition.

The common failure modes

Four failure modes come up consistently. SLOs that exist on paper because leadership wanted SLOs: the deck has them, the on-call doesn’t reference them, the error budget never triggers anything. Targets set by aspiration rather than capability: the SLO is 99.99% because the customer asked for it, the system was built for 99.5%, and the targets get missed for reasons that have nothing to do with reliability. Coverage gaps hidden by aggregating across services with different maturity levels, where the aggregate looks fine while hiding which services are real risks. And the “we’re at 99.9%” that means “99.9% of the requests we measured” — the SaaS company I opened with. The measurement excluded a sizeable share of the traffic. The first time anyone looked carefully at the boundary, the SLO program lost its credibility for a year.

The question worth asking before setting any target

Before you set an SLO target on a service, ask whether the observability infrastructure underneath it can support the measurement honestly. If it can’t, the target is decoration, and the cost of pretending otherwise compounds.

The honest answer is sometimes that the work to level up observability is the precondition, not a parallel track. That’s a hard message when leadership has already been told the SLO program is launching this quarter. It’s also the message that produces SLOs that mean something a year later, instead of a dashboard nobody trusts and a roadmap full of work to fix the program instead of the systems.

The SaaS company’s board eventually got an updated number, with a clear methodology, a defined measurement boundary, and a roadmap for the unmeasured proxy. The number was lower than the original. The board’s response was to ask better questions about reliability than they had been asking. An honest measurement produced honest conversations. The decoration version had been preventing them.

observability slo observability platform