sre · level 6

SLI Design

Picking measures that catch what hurts customers.

250 XP

SLI Design

Knowing you need an SLO is easy. Picking the SLI that actually catches what hurts customers — and ignores what doesn't — is the hard part.

A bad SLI fires alerts that nobody acts on, then misses the outage that matters. A good SLI maps tightly to user pain, is computable from data you already have, and aggregates cleanly across services.

Analogy

Think of an SLI as a fire alarm. A smoke detector in the kitchen is a different SLI than a smoke detector in the bedroom — one is "request slice = cooking smoke," the other is "request slice = sleeping safety." A good alarm in the bedroom triggers on real fire and ignores cooking smoke. A bad alarm triggers on every burnt toast and gets ignored when the real fire starts.

The design questions are the same for SLIs and smoke alarms: what events are you measuring? What's the boundary between OK and not-OK? How long do you average over before deciding to ring the bell?

What an SLI is

An SLI is a quantitative measure of service health, defined by three things:

Slice — which requests you're measuring. POST /api/checkout, all 5xx responses, the user-facing tier, etc.
Aggregation — how you turn many requests into one number. Two main archetypes (below).
Threshold — the boundary between "good" and "bad" for a single observation.

Concrete example: "the fraction of POST /api/checkout requests in a 1-minute window where server latency was less than 400 milliseconds and HTTP status was less than 500."

That's a complete SLI. You can compute it from request logs. You can compare it to a target (the SLO).

Three properties of a good SLI

Customer-visible. The metric correlates with what users perceive. Server-side p95 latency for a backend service is a fine SLI; CPU utilisation is not (no user has ever filed a ticket about CPU).

Additive across components. You can roll up service SLIs into a journey SLI without weird math. "Did the checkout work?" composes from cart-service-OK AND payment-service-OK AND email-service-OK.

Clear good-vs-bad. No fuzzy "kinda slow." Either latency is below the threshold or it isn't. Status code is < 500 or it isn't. If the boundary is fuzzy, the alert will be too.

Two SLI archetypes

Request-based — successful_requests / total_requests. Works when traffic is steady. Easy to interpret. Divides by zero on quiet services.

Window-based — % of windows where the service was good. A 1-minute window is "good" if all minutes within it met the threshold. Works when traffic can drop to zero overnight. Coarser than request-based — a service that's bad for one minute per hour scores 1/60 = 1.67% bad regardless of how much traffic hit during that minute.

Pick request-based for high-traffic, always-on services. Pick window-based for sparse-traffic services where divide-by-zero would be a problem and "did anyone notice in this window?" is the real question.

Anti-patterns

The four most common SLI design mistakes:

"Uptime" without defining a request. "The service was up 99.9% of the time" — up doing what? Returning 200s? Within latency? On the right hostname? Without a specific request slice, this is unmeasurable.

Ratios over zero. Request-based SLI on a service with overnight quiet periods. The 3 AM window has 0 requests, you compute 0/0, you get NaN, your dashboard breaks.

Mixing latency and availability. "99% of requests succeeded AND were under 500ms." Now you don't know which condition fired. Use two SLIs.

Server-side measurement that ignores client reality. Backend reports 200ms p95. CDN adds 300ms. ISP adds 100ms. User experiences 600ms. The server-side SLI says everything's fine. The user closes the tab.

Setting the SLO target

The target is a customer-pain threshold, not an engineering-comfort threshold.

99.5% over 30 days means 3.6 hours of monthly broken time is fine. Confirm that's actually fine for the business before committing. If revenue drops every minute the checkout is down, 3.6 hours is not fine — pick 99.9% (43 minutes/month) or 99.95% (22 minutes/month).

The SLO should be set by the business owner of the service, with engineering input on what's achievable. Engineering picks targets unilaterally when they want to look good; the business often agrees because nobody told them what the target meant in customer-pain terms.

Multi-window multi-burn-rate alerts

Single-window alerts miss two patterns:

Fast outage — a 30-second total outage doesn't move a 30-day rolling SLI noticeably. The alert never fires.
Slow drift — a deploy adds 100ms to p95, slowly eroding budget. A single fast alert never trips because no one minute is bad enough.

Multi-window multi-burn-rate alerts catch both:

Fast burn: 5-minute window. Fire when burn rate ≥ 14× (i.e., budget would be exhausted in ~2 days at current rate).
Slow burn: 1-hour window. Fire when burn rate ≥ 6× (budget exhausted in ~5 days).

Either condition catches a different failure mode. Both wake somebody up.

In the playground

The SLI Lab gives you a service description and lets you compose an SLI from dropdowns: slice, aggregation, threshold, window. Then it simulates 30 days of traffic with seeded incidents and tells you which incidents your SLI would have caught — and which false positives it threw. Iterate until the verdict is "caught all real incidents, zero false positives."

Tools in the wild

5 tools

Prometheusfree tier
Metric backend — `histogram_quantile()` powers most homegrown SLI calculations.
library
Slothfree tier
Generates the Prometheus recording + alert rules for fast/slow burn from a YAML SLO spec.
library
OpenSLOfree tier
YAML schema for declaring SLI queries + SLO targets independent of any backend.
spec
Datadog APM
Auto-derives latency + error SLIs from instrumented traces; native SLO product on top.
service
Honeycomb SLOs
SLOs computed over high-cardinality traces — SLI is any boolean derived field.
service