sre · level 7

Percentiles & Distributions

Why averages lie and what the tail tells you.

200 XP

Percentiles & Distributions

The mean is the worst summary statistic in performance work. It hides the long tail, weighting outliers against the comfortable middle, and most users never see it.

When a single 30-second outlier lands in a million-request day, the mean moves noticeably. The user who experienced 30 seconds doesn't care that the mean barely moved — they tabbed away. Percentiles are how you stop hiding from those users.

Analogy

Imagine you run a delivery service and someone asks: "How long does delivery take?"

The mean is "8 hours." Sounds fine. But the 95th percentile is 22 hours and the 99th is 3 days. The mean averages over the customer who got their package in 4 hours and the customer who got it in 4 days. Both real, both valid, but the customer who waited 4 days wrote you a one-star review.

p95 says "1 in 20 packages takes more than 22 hours." p99 says "1 in 100 packages takes more than 3 days." If you ship 1000 packages a day, that's 50 unhappy customers in the p95 bucket and 10 enraged ones in the p99 bucket. The mean told you everything was fine.

Percentiles in plain language

  • p50 (median) — half of requests are faster, half are slower.
  • p90 — 9 in 10 requests are faster than this number.
  • p95 — 19 in 20 requests are faster.
  • p99 — 99 in 100.
  • p99.9 — 999 in 1000. The slow tail you only see at scale.

Read them as "this many users had a worse experience than this number." That framing is what percentiles are for.

Why p99 matters at scale

A service serving 1,000 requests per second generates 10 p99-or-worse requests every second. Different user every time. The p99 is not an edge case — it's a continuous stream of bad experiences happening in parallel with the median experience.

If your p99 is 3 seconds and your p50 is 200ms, your service has two faces — and the 3-second face is showing up to 600 users every minute. They're filing tickets. The mean (probably 400ms) tells you none of this.

Distribution shapes

Real-world latency follows a few common patterns. Recognising the shape narrows down the cause:

Normal — symmetric bell curve. Rare in latency. Suggests a single dominant bottleneck with random small variation around it.

Log-normal — long right tail, peak near the median. Common in healthy services. Most requests are fast; a small but non-trivial fraction are much slower because of cache misses, garbage collection, network hiccups.

Bimodal — two peaks. Cold-start vs warm-start. Cached vs uncached. With-feature-flag vs without. The shape says "your service is actually two services."

Long-tail — heavy upper tail with rare massive outliers. A small fraction of requests take 50× the median because of GC pauses, lock contention, or downstream timeouts. The mean hides it, p99 might miss it, p99.9 catches it.

Reading a CDF vs a histogram

A histogram shows the shape of the distribution: how many requests landed in each latency bucket. Easy to see the modes, the tail.

A CDF (cumulative distribution function) shows the cumulative fraction of requests under each latency. Easy to read percentiles directly: trace up from the latency you care about and read off the percentile.

Both are useful. Histograms are good for "what does this look like?" CDFs are good for "what fraction of users had a bad time?"

Sample-size effects

Percentile estimates are noisy at small sample sizes. With 100 samples, your p99 is one specific sample — could easily be off by 20%. With 1000 samples, p99 is the average of the 10 worst — much more stable. With 10,000+, you start being able to trust p99.9.

If your dashboard's p99 jumps around minute-to-minute, check the sample count first. The signal might just be sparse data.

In the playground

The Latency Detective gives you a stream of latency samples and a queue of customer complaints. For each ticket, identify the percentile or distribution feature that explains it. The data hides bimodal patterns, tail outliers, and mean shifts. Win condition: 5 of 6 tickets diagnosed correctly.

Tools in the wild

5 tools
  • HdrHistogramfree tier

    Constant-bucket histogram with high precision and mergeable across hosts; reference impl.

    library
  • `histogram_quantile()` over preset buckets — the operator-friendly default.

    library
  • DDSketchfree tier

    Datadog's quantile sketch with mergeable, relative-error guarantees across hosts.

    library
  • t-digestfree tier

    Probabilistic quantile sketch popular for streaming p99 estimation.

    library
  • Speedscopefree tier

    Browser-based viewer for understanding latency distributions and time spent per stack.

    cli