Shaping Architecture With SLOs

Tue, Mar 4, 2025
9-minute read

When did an SLO last shape a roadmap decision at your company? Not appear in a quarterly review, not show up on a board deck, not get cited in an all-hands. Shape what got built, what got deferred, or what got rebuilt. If the answer takes a long pause, the SLO program is probably a dashboard.

SLOs went from internal Google vocabulary to industry standard quickly, and the vocabulary spread faster than the discipline. The SLO is treated as a measurement applied to the architecture, when it should be the input that determines the architecture. The treatment that fixes that is to push the SLO upstream of the design.

What the SLO is supposed to decide

If your SLO is doing real work, it’s making four kinds of decisions, and you can point to instances of each one in the last twelve months.

The first is where you invest in reliability, and where you don’t. Not every workload needs four nines. The internal scheduler that runs nightly batches has a different reliability budget than the customer-facing checkout flow, and the SLO is the input that distinguishes them.

The second is when to ship and when to stabilize. The error budget conversation lives downstream of this. The upstream version is “what does the SLO suggest this team should be doing this quarter?” If the answer is the same regardless of SLO state, the SLO isn’t doing the work.

The third is how much capacity, redundancy, and operational headroom to build in. Three nines and four nines aren’t different alerts. They’re different architectures, with different cost curves and different operational complexity. The SLO has to be selected before the architecture, not after.

The fourth is which workloads warrant which patterns, and which don’t. Multi-region active-active is a real cost. Strong consistency is a real cost. The pattern that’s right for the SLO that needs it is the wrong pattern for the SLO that doesn’t.

What an SLO worth setting looks like

A SaaS company I worked with, a platform team that had done the hard work of instrumenting their stack before most teams bothered, had a couple of dozen SLOs, most set on internal signals, including queue depth, pod restart rates, and P99 latency on internal services. None of those map cleanly to user impact. The pruning we did over a quarter brought them down to a handful of user-visible SLOs. The rest became internal signals: still measured, no longer treated as objectives. The team’s energy redirected to the seven that mattered.

The pattern that holds up is that SLOs worth setting share a small number of properties. They’re user-visible, measured at the boundary between the user and the system. The thing the user notices is the thing the SLO measures. Internal signals are fine to track; they aren’t SLOs. They’re tied to a user impact you can name: the user can complete checkout, the user can search and get results, the user receives the notification within the window the product promised. If you can’t articulate the user-facing thing the SLO protects, the SLO isn’t a commitment; it’s a measurement.

They’re also aggressive enough to matter but conservative enough to be honest about what the architecture can support. Most teams set SLOs too tight, which trains people to ignore them. A few set them too loose, which makes them ceremonial. The honest target is often a number that makes the team uncomfortable, because it forces a choice between investing in reliability or accepting a public commitment to less of it. The number that’s right is probably smaller than the number you have.

SLO properties compared by readiness level

The table below captures the difference between an SLO that’s doing the work and one that exists in name only. Teams often have all the infrastructure for the left column and none of the organizational muscle for the right.

Property	Dashboard SLO	Working SLO
Signal source	Internal metrics	User-visible boundary
Owner	Tool or team	Named individual
Consequence	None defined	Documented feature deferral or freeze
Review cadence	Ad-hoc or never	Quarterly with decision minutes
Architecture influence	None	Has blocked or redirected work

The architecture the SLO is asking for

Once an SLO is real, the architecture starts answering specific questions. Multi-region or single-region. Aggressive autoscaling or steady-state with headroom. Strong consistency or eventual consistency. Investment in failure domains or acceptance of the blast radius you have. Each of these is a cost decision and an SLO decision at the same time.

flowchart TD SLO[SLO target
user-facing reliability
commitment] --> Topology{Topology} SLO --> Capacity{Capacity model} SLO --> Consistency{Consistency model} SLO --> Failure{Failure domains} Topology --> Multi[Multi-region
active-active] Topology --> Single[Single region with DR] Capacity --> Auto[Aggressive autoscaling] Capacity --> Steady[Steady-state
with headroom] Consistency --> Strong[Strong consistency] Consistency --> Eventual[Eventual consistency] Failure --> Cell[Cell isolation] Failure --> Shared[Shared infrastructure] style SLO fill:#eaf2fa style Multi fill:#fff5e0 style Strong fill:#fff5e0 style Cell fill:#eaf2fa

Figure 1. The architecture lattice an SLO walks the team through. Each branch is a cost decision and an SLO decision at the same time. Teams that pick branches without naming the SLO that justifies them end up over-investing on workloads that don’t need it and under-investing on workloads that do.

A team I advised, one managing a growing portfolio of internal tools and trying to match the reliability posture of their customer-facing services, had built multi-region active-active for an internal admin dashboard: two engineers and a quarter of work. The dashboard had no user-facing SLO that justified it. They’d built the topology because the team that built the customer-facing service had built it, and the pattern propagated without the SLO question being asked. We pulled it back to single region with a clear DR posture. The freed capacity went to a different system that did have a user-facing SLO and was under-resourced for it.

The pattern generalizes. Architecture without SLOs as the input drifts toward the most reliable pattern available, regardless of whether the workload needs it. SLOs as the input force the conversation about which workload deserves which pattern.

What an SLO definition looks like in practice

The following YAML encodes an SLO with its alerting rules in a format compatible with Sloth or a similar SLO-as-code tool. The key discipline here is that the SLO definition names the user-visible indicator, not an internal signal.

# slo-checkout-availability.yaml
# SLO for the checkout service, user-visible availability.
# Error budget: 0.1% over 30 days (~43 minutes).
# Burn-rate alerts at 1h and 6h windows to catch fast burns early.

apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: checkout-availability
  namespace: slos
spec:
  service: "checkout"
  labels:
    owner: "platform-eng"
    tier: "user-facing"

  slos:
    - name: "availability"
      objective: 99.9
      description: >
        Users can complete a checkout request without a 5xx error.
        Measured at the load balancer boundary, not at the service level.

      sli:
        events:
          errorQuery: >
            sum(rate(http_requests_total{
              service="checkout",
              status_code=~"5.."
            }[{{.window}}]))
          totalQuery: >
            sum(rate(http_requests_total{
              service="checkout"
            }[{{.window}}]))

      alerting:
        name: CheckoutAvailabilitySLOBurn
        annotations:
          summary: "Checkout SLO burn rate too high"
          runbook: "https://runbooks.internal/checkout-availability"
        pageAlert:
          # Fast burn: exhausting 5% of monthly budget in 1 hour
          labels:
            severity: page
        ticketAlert:
          # Slow burn: exhausting 10% of monthly budget in 6 hours
          labels:
            severity: ticket

The error budget conversation, briefly

Error budgets are downstream of SLOs, and I’ve covered the budget mechanics elsewhere. The short version: the budget is a feature-versus-stability lever, and it works only if the org has explicit decision rights about who gets to spend it. The political problem in most organizations isn’t the math. It’s deciding whether the platform team or the product team owns the budget when they both want to spend it on different things.

The cultural change SLOs imply is the part most organizations underestimate, and the part I find hardest to convince people on, because it’s fundamentally a political problem more than a technical one. The org that has been measuring availability has to start measuring reliability investment as a tradeoff. That conversation is uncomfortable on both sides: product hears “we’re going to ship less,” and platform hears “we have to defend our reliability work as a discrete budget item.” The discomfort is the program working.

Ownership, consequences, and the negotiation

The SLO that doesn’t have an owner is decoration. The owner is the person who has to defend it, can call the conversation when it’s at risk, and is accountable when the architecture isn’t supporting it. If you can’t name them, the SLO isn’t real.

The SLO without consequences is a dashboard. The consequence doesn’t have to be dramatic. It can be a documented feature deferral. It can be a release freeze. It can be a quarterly conversation that changes the roadmap. The point is that the SLO state has to translate to a decision that wouldn’t otherwise have happened.

flowchart LR State[SLO state
at quarter end] --> Q{Budget status?} Q -->|Healthy: >20% remaining| Invest[Feature work
proceeds] Q -->|Caution: 5–20% remaining| Review[Reliability review
before new features] Q -->|Burned: <5% remaining| Freeze[Feature freeze
reliability sprint] Invest --> Log[Document in
quarterly minutes] Review --> Log Freeze --> Log style Freeze fill:#fdd style Review fill:#fff5e0 style Invest fill:#eaf2fa

Figure 2. An error budget decision flow. The value is not in the diagram itself but in the fact that every outcome is documented. Undocumented deferrals disappear from institutional memory within a quarter.

The pattern that works is an SLO with an owner, a quarterly negotiation, a defined consequence, and a track record of the consequence happening. Without those four, you have an SLO program in name. With them, you have one that shapes architecture. The maturity question is rarely about the math. I’ve written about setting SLO targets when teams have inconsistent observability maturity, and the same lesson applies here: the program lives or dies on whether the organization treats it as a real input.

The honest test

SLOs are an architectural input, not an output. Treat them that way and you’ll build different systems than the org that treats them as a reporting artifact. You’ll invest where the SLO requires it, defer where it doesn’t, and have a defensible answer for both directions.

The honest test of your SLO program is whether an SLO has ever caused your team to do less, ship later, or rebuild something. If the answer is no, the SLO isn’t doing the work it’s supposed to. The dashboard is fine. The dashboard isn’t the program.

The platform VPs who take the question seriously usually come back six months later with fewer SLOs, named owners, and a quarterly negotiation that was, by their own admission, awkward the first time. By the next quarter, the negotiations are producing documented deferrals. The dashboard hasn’t changed much. The architecture has. The question is which version of that story your org is in the middle of right now.

reliability slo reliability sre architecture