Monitoring at Scale

Tue, Oct 21, 2025
10-minute read

Metrics costs don’t scale with capacity. They scale with cardinality. A platform that grows from 50 to 500 services can see its bill grow by an order of magnitude while the underlying infrastructure has only tripled. Nobody decides this. Every team makes individually reasonable choices: a label for a customer ID here, a dimension for a feature flag there, a new service instrumented with the same conventions as the old one. The individual decisions look fine. The product of them looks like a billing shock.

The fix when this happens isn’t a discount renegotiation or a tool change. It’s admitting that the monitoring architecture built at 50 services is still running at 500, and the architecture doesn’t scale. Patterns that work at small scale aren’t bigger versions of the same patterns at large scale. They’re different patterns, with different economics, different ownership models, and different failure modes.

What changes at scale

Four shifts arrive together as a monitoring stack grows past a couple of hundred services.

Cardinality grows non-linearly. Each new service adds a few labels. Each label can take many values. A label bounded at 10 values when introduced can grow to 10,000 if it’s tied to a customer ID or a feature flag that proliferates. The product of services, dimensions, and label cardinalities determines storage and query cost. Most teams haven’t modeled the product.

The bill becomes a significant line item. At 50 services, observability cost is small enough that nobody asks. At 500, it’s often the largest line in the platform team’s budget, sometimes larger than compute. That’s the moment finance starts asking pointed questions, and the platform team has to defend a budget against architectural decisions made by other teams.

Dashboards proliferate past the number of humans who can maintain them. A team of eight can keep 20 dashboards healthy. They cannot keep 600 healthy, and they don’t try. The dashboards become a graveyard. New dashboards get added because nobody can find the existing one that does the same thing, and the cycle compounds.

Signal-to-noise ratio degrades. The alerts that mattered at 50 services drown in the alerts that fired once during a deploy three years ago and never got tuned out. On-call burns out. People build personal coping mechanisms, like filters that hide whole categories of pages. The platform’s effective alerting becomes whatever the on-call rotation has trained itself to ignore.

All four shifts are connected: each makes the others worse, and none of them are visible until they’ve been compounding for years.

What the architecture has to look like

Monitoring at scale needs four architectural decisions made deliberately. None of them are exotic. All of them have to be made early, because retrofitting them at scale is the multi-quarter remediation project nobody wants to fund.

Aggregation tiers: collect locally, aggregate regionally, query globally. The shape lets high-cardinality detail stay close to where it’s generated, with progressively coarser representations available for queries that don’t need it. Most teams never tier; they ship every metric, at full cardinality, to a central system. The bill reflects the choice.

Retention as a function of value, not a default. Different signals deserve different retention. The error rate of the checkout service deserves long retention because it informs trend analysis and SLO tracking. The CPU utilization of an individual pod from three months ago deserves to be aggregated, then dropped. Most teams set a default and apply it everywhere, paying to retain detail nobody queries.

Sampling as a design decision, not a reaction to a billing shock. Teams that handle scale well treat it as part of the design from the start, with documented rules for what gets sampled, what doesn’t, and why. Teams that don’t end up sampling reactively when the bill spikes, with no model for what they’re losing visibility into.

The schema as a contract. Labels, dimensions, semantic conventions: enforced, not aspirational. A schema document that lives in a wiki and gets ignored is a schema that produces unbounded cardinality. The schema enforced at admission, with a real veto on emissions that violate it, is the schema that holds.

This connects directly to the broader telemetry architecture conversation, which is mostly an economics conversation: who emits what, at what cost, with what review. Monitoring at scale is the same conversation in a more specific frame.

Patterns at 50 vs. 500 services

The table below captures where the small-scale architecture assumption breaks and what the replacement looks like. None of these transitions are difficult in isolation. The difficulty is that all of them need to happen before the system has grown large enough to make retrofitting them painful.

Dimension	Works at 50 services	Breaks at 500	Replacement pattern
Cardinality	Per-team discretion	Unbounded cost growth	Budget per service, CI enforcement
Retention	Single global default	Dominant cost, low query value	Tiered by signal type and SLO linkage
Dashboards	Informal ownership	Graveyard, no canonical view	Named owner, expiry, quarterly review
Alerting	Ad hoc thresholds	Noise overwhelms signal	Quality gates: runbook, owner, tuning history
Schema	Convention by example	Divergent labels, no joins	Schema enforced at admission

flowchart LR Local[Local agents
full cardinality
short retention] --> Regional[Regional aggregators
reduced cardinality
medium retention] Regional --> Global[Global stores
SLO and trend signals
long retention] style Local fill:#eaf2fa style Regional fill:#fff5e0 style Global fill:#eaf2fa

Figure 1. The aggregation tiering that scales. Most teams collapse this into a single tier and pay the cost of full-cardinality retention everywhere. The remediation cost grows with the data volume already retained, so the time to design the tiering is before the volume is there.

A Prometheus recording rules example

The following recording rule configuration illustrates what “aggregation at the tier boundary” looks like in practice. Recording rules pre-compute expensive aggregations on a schedule, so dashboards and alerts query the result instead of the raw data. At scale, this is the difference between a query that takes 200ms and one that takes 20 seconds.

# recording_rules.yaml
# Pre-aggregated metrics for SLO dashboards and cross-service queries.
# These rules run on the regional aggregator, not the central store.
# Raw per-pod metrics are retained locally for 7 days, then dropped.

groups:
  - name: slo_aggregations
    interval: 60s
    rules:
      # Aggregate request success rate across all pods in a service
      - record: service:http_request_success:rate5m
        expr: |
          sum by (service, env) (
            rate(http_requests_total{status_code!~"5.."}[5m])
          )
          /
          sum by (service, env) (
            rate(http_requests_total[5m])
          )

      # Pre-aggregate P99 latency per service (not per pod)
      - record: service:http_request_duration_p99:5m
        expr: |
          histogram_quantile(0.99,
            sum by (service, env, le) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )

  - name: cardinality_guards
    interval: 300s
    rules:
      # Alert when a metric's series count exceeds the per-service budget
      - alert: CardinalityBudgetExceeded
        expr: |
          count by (service) (
            {__name__=~".+"}
          ) > 50000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Service {{ $labels.service }} exceeded cardinality budget"
          description: >
            {{ $labels.service }} has {{ $value }} active series.
            Budget is 50,000. Review label dimensions and open a cardinality
            review ticket before the next billing cycle.
          runbook: "https://runbooks.internal/cardinality-budget"

Patterns that work

A handful of patterns separate monitoring stacks that absorb growth from those that don’t.

flowchart TD New[New metric or label
proposed by team] --> Q1{Cardinality
bounded?} Q1 -->|Yes| Q2{Service budget
has headroom?} Q1 -->|No, unbounded| Reject[Reject at CI
open cardinality
review ticket] Q2 -->|Yes| Q3{SLO-linked
signal?} Q2 -->|No, over budget| Warn[Warn team
review required
before merge] Q3 -->|Yes| LongR[Route to
long retention] Q3 -->|No| ShortR[Route to
short retention] style Reject fill:#fdd style Warn fill:#fff5e0 style LongR fill:#eaf2fa style ShortR fill:#eaf2fa

Figure 2. An admission decision flow for new metrics at scale. The cardinality check and budget check happen at merge time, not at billing time. Moving this gate upstream is the single change that most reduces remediation cost.

SLO-driven monitoring as a forcing function for relevance. Start from user-visible impact. Prune everything that doesn’t connect. The discipline isn’t about reducing telemetry; it’s about giving every metric a reason to exist, and the reason has to trace back to something a user or a customer cares about. Telemetry that doesn’t trace gets retired, or at least deprioritized in retention.

Cardinality budgets per team, with visibility and enforcement. The team that owns a service owns its cardinality, and the platform shows them what they’re spending. When the budget is exceeded, the conversation is between the team and the platform, not between the platform and finance. The mature version has automated rejection at admission, not just polite emails.

Self-service alerting with platform-enforced quality gates. Anyone can add an alert. The platform enforces minimum standards: a runbook link, an owner, a notification target, a tuning history. Alerts without owners get retired automatically. The pattern looks bureaucratic until you’ve spent six months trying to retire alerts manually in a 600-dashboard environment.

Dashboard ownership and lifecycle. Every dashboard has an owner, a purpose statement, and an expiration. Dashboards without owners get archived after a window of no views. The work is unglamorous and pays for itself the first time on-call doesn’t have to triage which of three “production overview” dashboards is the current one.

The honest admission on this set: the cardinality budget enforcement is the right answer and also the one that generates the most organizational resistance. Engineering teams read “cardinality budget” as “someone is going to tell us we can’t add labels,” which is exactly right, and the friction is the point. Getting that conversation to happen before the billing shock arrives, not as a consequence of it, is the part that depends on leadership support more than technical tooling.

Patterns that break

Equally instructive is the catalog of patterns that scale poorly, all of which appear in shops that grew without revisiting their monitoring architecture.

The central dashboard team that becomes a bottleneck accumulates a backlog the org eventually gives up on. The team meant to enable the rest of engineering ends up gating it. New service launches stall waiting for dashboards. The team takes on the political cost of telling people no and burns out doing it.

The “we’ll keep everything” retention policy is fine at small scale, but at scale the policy is the dominant cost, and most of what’s retained is never queried. The policy persists because nobody owns the retire-or-justify cycle.

The alerting system that pages on-call for things no one acts on eventually trains the team to ignore the noise. The dangerous alerts get ignored along with the rest, and the first time the alerting system would have mattered is the first time it doesn’t.

These patterns build directly on top of the foundational monitoring architecture conversation and the observability patterns for distributed infrastructure work. The single-system monitoring patterns scale poorly because they were never designed for the cross-team coordination that scale requires.

The organizational dimension

Monitoring at scale is as much an ownership question as a technical one. The platform team operates the monitoring stack as a product, with real users, SLAs, a roadmap, and deprecation paths for old patterns. Teams that consume the platform are real customers with real expectations, and the platform team treats them that way.

The contract between them has to be explicit. Who can emit what, at what cost, with what review. The schema is part of the contract. The retention policy is part of the contract. The cardinality budget is part of the contract. None of this lives in a wiki that nobody reads. It lives in admission policies and dashboards the teams see weekly.

The pattern is recognizable from other large platform conversations: the platform as a product, with internal customers and a real economic relationship between them. Monitoring at scale doesn’t survive without it.

Where this lands

Monitoring at scale is primarily a question of economics and ownership, before it’s a question of tooling. Design the stack to absorb growth predictably and the bill grows in proportion to the value.

The teams that get this right treat their monitoring stack as a system with its own architecture: owned, budgeted, and reviewed on a cadence that matches how fast the rest of the platform is moving. The ones that don’t tend to find out the hard way, usually in the quarter when finance escalates and engineering has to explain a cost trajectory that was set years before anyone named it as a problem. What’s worth asking before that moment arrives is whether your current architecture was designed for the scale you’re at now, or the scale you were at when you first turned it on.

observability monitoring observability scalability