sre · level 2

Observability

Metrics, logs, traces — and when each one wins.

200 XP

Observability

A system you cannot observe is a system you cannot operate. Observability is not a single tool — it is the property of a system that lets you answer questions about its internal state from its external outputs.

The three standard signals are metrics, logs, and traces. Each answers different questions. Each has different costs.

Analogy

Observing a service is like running a fleet of delivery vans. Metrics are the dashboard gauges on each van: speed, fuel level, engine temperature — a constant stream of numbers telling you whether things are broadly okay. Logs are the driver's daily journal: "9:12am, flat tire on Elm Street, swapped spare, resumed at 9:41" — discrete events written down when they happen, searchable when you need to know what went wrong with one specific van. Traces are the GPS breadcrumb trail for a single package: it left the warehouse at 7:05, sat at the sorting hub until 9:20, rode truck #42 to the neighborhood depot at 10:15, and arrived at the doorstep at 11:04 — which hop ate the time. The gauges tell you the fleet is slow; the breadcrumbs tell you why.

Metrics

Metrics are numerical measurements aggregated over time. They tell you the shape of behaviour: request rate, error percentage, latency distribution, queue depth.

When metrics win:

  • Dashboards and alerting. A p99 latency crossing a threshold fires a page.
  • Long-term trends. Capacity planning over weeks.
  • Cheap correlation. One number, stored once per interval, cheap to query.

The cardinality trap: Metrics explode when you label them with high-cardinality dimensions. http_requests_total{user_id="12345"} is a separate metric series per user. With ten million users, you've created ten million series. Storage and query cost follow. Keep metric labels to stable, low-cardinality dimensions: method, status_code, endpoint_group — not user IDs or request IDs.

Logs

Logs are structured or unstructured records of discrete events. They tell you what happened, when, and to which request.

When logs win:

  • Debugging a specific request. "Why did user 12345's checkout fail?"
  • Audit trails. Who called this API at this time?
  • Events that are rare enough that full records are affordable.

Structured logging means emitting JSON (or key-value pairs) instead of printf strings:

{"level":"error","ts":1704067200,"request_id":"req-abc","user_id":"u-123","msg":"payment declined","code":"insufficient_funds"}

Structured logs are queryable. WHERE code = 'insufficient_funds' works. grep "insufficient_funds" on a string log works too, but you can't filter by user_id without another pattern match.

Cost warning: Full request logging at high throughput is expensive. Log at DEBUG in dev; aggregate or sample in production.

Traces

Traces follow a request as it moves through services. A trace is a tree of spans — each span records one unit of work: a database query, an RPC call, a cache lookup.

When traces win:

  • Latency attribution. "The checkout took 2.3s. Which service ate it?"
  • Cascading failures. A downstream service is slow, and every upstream service shows the same pattern.
  • Understanding distributed call graphs you've never read before.

A trace shows you causality that metrics cannot. "p99 is high" is a metric. "The p99 comes from a specific SQL query that runs in the inventory service when the cart exceeds ten items" is a trace.

Cost: Full trace capture is expensive at high volume. Head-based sampling (decide at the root) or tail-based sampling (decide after you know the outcome) controls cost while preserving the interesting cases.

RED vs USE

Two methods frame what to measure depending on the kind of component:

RED (for services handling requests):

  • Rate — requests per second
  • Error rate — fraction of requests that fail
  • Duration — latency distribution

USE (for infrastructure resources):

  • Utilisation — how busy is it, as a fraction of capacity
  • Saturation — how much work is queued or delayed
  • Errors — hard failures

A Kubernetes pod that processes HTTP requests wants RED metrics. The node's CPU, memory, and network interfaces want USE metrics.

Picking the right signal

Question Signal
Is the service healthy right now? Metrics
Why did this specific request fail? Logs
Which service added 400ms to this checkout? Traces
Has latency trended upward since Tuesday's deploy? Metrics
What SQL query ran during that slow trace? Traces + Logs
Who called the delete endpoint at 3 AM? Logs

The signals are complementary, not competing. Production observability stacks combine all three with a correlation layer — a request ID threaded through logs and traces links them to the same event.

Tools in the wild

7 tools
  • OpenTelemetryfree tier

    CNCF standard SDK + collector for emitting traces, metrics, and logs from any language.

    library
  • Prometheusfree tier

    Pull-based time-series database; the de-facto open-source metrics backend.

    library
  • Grafanafree tier

    Dashboards over Prometheus, Loki, Tempo, and most cloud telemetry sources.

    service
  • Lokifree tier

    Log aggregator that indexes labels (not content) — Prometheus model for logs.

    library
  • Tempofree tier

    Open-source distributed tracing backend; cheap object-storage-backed traces.

    library
  • High-cardinality observability built on traces; great for complex production debugging.

    service
  • All-in-one metrics + logs + APM + RUM platform; widely used at mid/large companies.

    service