sre · level 1

SLOs and Error Budgets

What 99.9% actually costs you.

200 XP

SLOs and Error Budgets

Every service makes an implicit promise to users. SRE's job is to make that promise explicit, measure it continuously, and spend failures wisely.

Analogy

An airline promises "99% of flights on time." That leaves roughly four late flights out of every four hundred scheduled — the error budget. The airline can spend that budget on weather delays, an unexpected maintenance hold, or the captain deciding to de-ice one more time. If a snowstorm burns through the entire quarterly budget in one week, the airline stops taking chances: no new route launches, no optional gate swaps, no clever turnarounds — just fly what's scheduled until the budget refills. Burn rate is the control tower noticing "we're losing flights twice as fast as usual, we'll be out by Tuesday." The 99% target only works because everyone — pilots, ops, marketing — agrees what happens when the budget hits zero.

The vocabulary

Term What it measures Who owns it
SLI (Service Level Indicator) A quantifiable metric: request success rate, p99 latency, queue depth. Engineering
SLO (Service Level Objective) The target you commit to: "99.9% of requests succeed over a 30-day window." Engineering + Product
SLA (Service Level Agreement) The contract with the customer, usually looser than the internal SLO. Business + Legal

The SLI is the measurement. The SLO is the goal. The SLA is what you pay for if you miss it.

What 99.9% actually buys you

The number sounds impressive until you do the arithmetic.

Availability Downtime per 30 days Downtime per year
99% ~7.2 hours ~3.65 days
99.9% ~43 minutes ~8.7 hours
99.95% ~21.6 minutes ~4.4 hours
99.99% ~4.3 minutes ~52 minutes

43 minutes per month is your entire budget at 99.9%. One botched deploy that takes 50 minutes to roll back has just burned next month's margin too.

The error budget

The error budget is the complement of the SLO. At 99.9%, you have a 0.1% budget — that's 43 minutes of failure per 30-day window.

The budget is a resource. You can spend it on:

  • Risky deploys. Ship fast, burn a few minutes, learn faster.
  • Planned maintenance. That database migration you've been delaying.
  • External failures. DNS, third-party APIs, cloud region outages.

When the budget is full, you ship fast. When it's exhausted, you freeze new features and focus on reliability work. This is the core deal between product and SRE.

Burn rate

Burn rate tells you how quickly you're consuming the budget relative to the window. A burn rate of 1× means you're exactly on pace to exhaust the budget by the end of the window. A burn rate of 2× means you'll be out in half the time.

burn_rate = (1 - measured_sli) / (1 - slo_target)

If your SLO is 99.9% and your current error rate is 0.5%, burn rate is:

(1 - 0.995) / (1 - 0.999) = 0.005 / 0.001 = 5×

At 5× burn rate, a 30-day budget exhausts in 6 days. Your alerting should fire well before that.

Choosing the right SLI

Not every metric makes a good SLI. Good SLIs are:

  • Measurable at the boundary. Request success rate at the load balancer, not "the CPU is fine."
  • User-visible. Latency at p99 affects real users. Internal queue depth may not.
  • Comparable. You need a meaningful baseline to set a target against.

Poor SLIs: CPU utilisation, memory usage, disk IOPS. These are resources, not user outcomes. A service can have maxed-out CPU and still serve requests correctly.

Good SLIs: HTTP 2xx rate, p50/p99/p999 latency, RPC error rate, data freshness in seconds.

Setting the SLO

Start with measurement, not aspiration. What is your current baseline? If you've been at 99.95% for six months with no deliberate effort, a 99.9% SLO is easy to meet but not useful. A 99.95% SLO creates productive tension.

The SLO should be:

  1. Achievable — based on historical data plus a meaningful improvement target.
  2. Slightly tighter than the SLA — so you catch violations before the customer does.
  3. Small enough in number — track one or two core SLIs per service, not twenty.

The compact between product and SRE

Error budgets only work when both sides respect the compact. Product agrees: if the budget is gone, new features pause. SRE agrees: if the budget is healthy, they support shipping velocity without excessive gate-keeping.

Without that agreement, the error budget is just a dashboard number. With it, it's how you decide what to do next.

Tools in the wild

5 tools
  • Slothfree tier

    Generates Prometheus SLO recording + alerting rules from a simple YAML spec.

    library
  • OpenSLOfree tier

    Vendor-neutral spec for declaring SLOs as code; consumed by Sloth, Nobl9, and Datadog.

    spec
  • Hosted SLO platform — ingests metrics from Datadog, Prometheus, NewRelic, etc.

    service
  • Built-in SLO tracking with monitor-based, metric-based, and time-slice variants.

    service
  • Grafana SLOfree tier

    SLO builder layered on Grafana Cloud's metrics + alerting stack.

    service