The Cost of Telemetry
Telemetry architecture is mostly an economics conversation. The interesting question isn’t “metrics or traces.” It’s what the cardinality budget is, what the retention tier looks like, and who pays for the query. Teams that read it as a tooling conversation end up with the surprise invoice and the hasty pruning project that follows.
The pattern is familiar. One label added to a hot path six months earlier so a team could break out errors by tenant: harmless in the merge request, multiplying the cardinality of every metric on that path by the tenant count, which is four digits and growing. Nobody on the platform team is in the review. Nobody on the application team has a way to see what the label costs. The bill arrives quietly, then suddenly.
Three signals, three economic shapes
The “three pillars” framing has been picked apart for years and is still the working vocabulary in most rooms I sit in. Useful enough. The signals have genuinely different shapes, and the shapes drive different decisions:
| Signal | Per-event cost | Cost dominated by | Retention pain |
|---|---|---|---|
| Metrics | Small | Label cardinality | Low if you control labels |
| Logs | Large | Volume and indexing | High at scale |
| Traces | Medium | Span fan-out and sampling | High at full fidelity |
| Events (deploys, config, alerts) | Small | Neglect until incident review | Low |
A retention policy that works for metrics will bankrupt you on logs. A sampling policy that works on traces will lie to you on metrics. The mistake teams make is choosing a vendor and inheriting one set of defaults across all three. Events tend to get neglected until an incident review needs them and they’re not there.
Where the data is born
Telemetry decisions get made early in the call stack, often by people who don’t know they’re making them. A junior engineer adds a counter with five labels. A library author bakes in a tracing convention. A platform team picks a collector and ships it before the conversation about what it should collect.
OpenTelemetry has been the convergence the industry mostly settled on, and it has earned that standing. The collector model decouples instrumentation from backend choice, which preserves optionality the old proprietary agents quietly destroyed. It still wobbles in places: the metrics SDK has been less stable than the traces SDK, and the semantic conventions for some domains are still in churn. None of that is a reason not to adopt it. It’s a reason to know which parts of the stack are mature and which ones aren’t.
The boundary that most stacks blur without naming is between application telemetry and infrastructure telemetry. Application telemetry asks “is my service healthy from the user’s perspective?” Infrastructure telemetry asks “is the substrate the service runs on healthy?” Different owners, different cadences, different signals. Teams that merge them under one collector and one budget end up with the application team unable to debug their service because the platform team’s host metrics ate the cardinality budget. I’ve covered the broader version of this split in the observability patterns piece.
Transport, storage, and the cardinality bomb
Push versus pull is the first decision teams argue about and the least important one. Both work. The interesting decisions live one layer down.
Aggregation, sampling, and retention are three independent levers, and most teams pull them as if they were one. Aggregation reduces dimensional fidelity at write time. Sampling reduces volume by keeping a subset. Retention reduces volume by aging things out. You can mix all three. Aggregating at the collector before the data hits the backend is often the cheapest move, because cardinality reduction is multiplicative as it propagates downstream.
The cardinality bomb is the named version of the per-tenant label pattern. A single high-cardinality label on a hot path multiplies metric volume by the cardinality of that label across every other dimension. At a company that was trying to get ahead of their next funding round without blowing up their infrastructure budget, the bill had grown sharply year over year with no change in capacity, headcount, or product surface. The fix was collector-level dimensional aggregation: drop the per-tenant label at the collector for the dashboards that didn’t need it, and keep it for the small set of queries that did. The bill stabilized in two billing cycles. The harder fix was a cardinality budget per service, with a CI check that flagged label additions before they shipped.
The honest admission here: the cardinality budget per service is the right answer, and it’s also the one that creates friction for teams who’ve never thought about label cost. Getting engineering leadership to treat that friction as worthwhile, before a billing shock forces the conversation, is the part of this problem that has no clean solution.
emits events with labels] --> Coll[Collector
aggregation, sampling,
redaction] Coll --> Hot[Hot tier
full fidelity,
short retention] Coll --> Warm[Warm tier
downsampled,
medium retention] Coll --> Cold[Cold tier
archived,
query on demand] Hot --> Q[Queries, alerts,
dashboards] Warm --> Q Cold -.->|rare| Q style Coll fill:#eaf2fa style Hot fill:#eaf2fa style Warm fill:#fff5e0 style Cold fill:#fff5e0
Figure 1. Aggregation, sampling, and retention are three independent levers. Most teams collapse them into one decision at the vendor level and lose the ability to tune each one independently.
Choosing a retention tier per signal
The table above describes the economic shape of each signal. The table below shows how to translate that shape into retention decisions. Teams that apply a single default across all signal types are paying for retention they don’t query and losing retention they do.
| Signal | Hot tier | Warm tier | Cold tier | Notes |
|---|---|---|---|---|
| Metrics (SLO-linked) | 30 days | 13 months | Indefinite (downsampled) | Trend analysis requires long retention |
| Metrics (non-SLO) | 7 days | 30 days | Drop | Most teams over-retain this tier |
| Logs (error/audit) | 30 days | 90 days | Compliance-driven | Audit logs may have regulatory minimums |
| Logs (debug) | 3 days | Drop | Drop | Expensive to retain; rarely queried past 72h |
| Traces (sampled) | 7 days | 30 days | Drop | Full-fidelity traces age out quickly |
| Events (deploy, config) | 90 days | 2 years | Indefinite | Cheap; invaluable for incident review |
cardinality review
required] Q2 -->|Yes| Approve[Approve and track
against service budget] Q2 -->|No| Q3{Dashboard exists
that queries it?} Q3 -->|Yes| Approve Q3 -->|No| Defer[Defer: add dashboard
first, then label] style Block fill:#fdd style Defer fill:#fff5e0 style Approve fill:#eaf2fa
Figure 2. A label admission decision flow. Blocking unbounded labels at the merge request is the cheapest point to enforce cardinality policy. Teams that push this check downstream find the cost of reverting labels after services depend on them.
What a collector config actually looks like
The following is a representative OpenTelemetry Collector configuration that implements the tier model: aggregation at the collector, with routing rules that send SLO-linked metrics to long retention and debug metrics to a short-lived backend.
# otelcol-config.yaml
# Implements tiered retention routing with cardinality reduction.
# SLO-linked metrics route to the long-retention backend.
# Debug metrics route to the short-retention backend and expire in 7 days.
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
processors:
# Aggregate high-cardinality labels before they hit storage
metricstransform/aggregate_tenant:
transforms:
- include: http_requests_total
match_type: strict
action: update
operations:
- action: aggregate_labels
label_set: [service, status_code, method] # drops tenant_id
aggregation_type: sum
# Tag SLO-linked metrics for routing
attributes/slo_tag:
actions:
- key: retention_tier
value: long
action: insert
pattern: "^(checkout|search|auth).*"
batch:
timeout: 10s
send_batch_size: 1000
exporters:
prometheusremotewrite/long:
endpoint: "https://metrics-long.internal/api/v1/write"
headers:
X-Retention-Days: "395"
prometheusremotewrite/short:
endpoint: "https://metrics-short.internal/api/v1/write"
headers:
X-Retention-Days: "7"
service:
pipelines:
metrics/slo:
receivers: [otlp]
processors: [metricstransform/aggregate_tenant, attributes/slo_tag, batch]
exporters: [prometheusremotewrite/long]
metrics/debug:
receivers: [otlp]
processors: [metricstransform/aggregate_tenant, batch]
exporters: [prometheusremotewrite/short]
Who actually queries the data
Anyone can build a dashboard. Few systems support open-ended investigation, which is the real test of whether your telemetry stack is doing its job. The dashboards your team built six months ago answer the questions you knew to ask six months ago. The questions you’re going to need to ask during the next incident haven’t been asked yet.
That’s why the cost model your vendor uses matters more than any technical decision in the stack. Per-host pricing makes you stingy with hosts and generous with cardinality, which is backwards. Per-GB pricing makes you stingy with retention and generous with sampling. Per-query pricing makes engineers afraid to investigate, which is the worst of the three. The model your vendor chose has shaped what your team feels free to ask.
Dashboards have an ownership problem. They get built for a project, the project ends, the dashboard remains. Six months later half the dashboards on the wall reference services nobody has touched. Name an owner per dashboard, review the list quarterly, delete generously. Nobody likes deleting dashboards, which is exactly why the list keeps growing. I’ve written about this as part of the broader monitoring architecture conversation.
Pricing the signal against the purpose
Telemetry architecture is one of the few places where the economics directly shape what your team can know about your system. Treat the cost curve as someone else’s problem and you’ll get the surprise invoice, the pruning project that lasts longer than expected, and engineers losing access to telemetry they were depending on, without anyone meaning that to happen.
Price each signal against what it’s for. Make both the cost and the purpose visible to the engineers generating the data. Tie the SLOs you’ve committed to to the specific signals that compute them, and protect those signals when the pruning starts. The conversation about which dashboard to delete gets much easier when everyone in the room can see what the dashboard costs and what it’s worth.
The engineers who figure this out usually do it after the first billing shock. The question worth sitting with is whether there’s a way to run the conversation before the shock forces it.