SLOs and Error Budgets
What 99.9% actually costs you.
SLOs and Error Budgets
Every service makes an implicit promise to users. SRE's job is to make that promise explicit, measure it continuously, and spend failures wisely.
Analogy
An airline promises "99% of flights on time." That leaves roughly four late flights out of every four hundred scheduled — the error budget. The airline can spend that budget on weather delays, an unexpected maintenance hold, or the captain deciding to de-ice one more time. If a snowstorm burns through the entire quarterly budget in one week, the airline stops taking chances: no new route launches, no optional gate swaps, no clever turnarounds — just fly what's scheduled until the budget refills. Burn rate is the control tower noticing "we're losing flights twice as fast as usual, we'll be out by Tuesday." The 99% target only works because everyone — pilots, ops, marketing — agrees what happens when the budget hits zero.
The vocabulary
| Term | What it measures | Who owns it |
|---|---|---|
| SLI (Service Level Indicator) | A quantifiable metric: request success rate, p99 latency, queue depth. | Engineering |
| SLO (Service Level Objective) | The target you commit to: "99.9% of requests succeed over a 30-day window." | Engineering + Product |
| SLA (Service Level Agreement) | The contract with the customer, usually looser than the internal SLO. | Business + Legal |
The SLI is the measurement. The SLO is the goal. The SLA is what you pay for if you miss it.
What 99.9% actually buys you
The number sounds impressive until you do the arithmetic.
| Availability | Downtime per 30 days | Downtime per year |
|---|---|---|
| 99% | ~7.2 hours | ~3.65 days |
| 99.9% | ~43 minutes | ~8.7 hours |
| 99.95% | ~21.6 minutes | ~4.4 hours |
| 99.99% | ~4.3 minutes | ~52 minutes |
43 minutes per month is your entire budget at 99.9%. One botched deploy that takes 50 minutes to roll back has just burned next month's margin too.
The error budget
The error budget is the complement of the SLO. At 99.9%, you have a 0.1% budget — that's 43 minutes of failure per 30-day window.
The budget is a resource. You can spend it on:
- Risky deploys. Ship fast, burn a few minutes, learn faster.
- Planned maintenance. That database migration you've been delaying.
- External failures. DNS, third-party APIs, cloud region outages.
When the budget is full, you ship fast. When it's exhausted, you freeze new features and focus on reliability work. This is the core deal between product and SRE.
Burn rate
Burn rate tells you how quickly you're consuming the budget relative to the window. A burn rate of 1× means you're exactly on pace to exhaust the budget by the end of the window. A burn rate of 2× means you'll be out in half the time.
burn_rate = (1 - measured_sli) / (1 - slo_target)
If your SLO is 99.9% and your current error rate is 0.5%, burn rate is:
(1 - 0.995) / (1 - 0.999) = 0.005 / 0.001 = 5×
At 5× burn rate, a 30-day budget exhausts in 6 days. Your alerting should fire well before that.
Choosing the right SLI
Not every metric makes a good SLI. Good SLIs are:
- Measurable at the boundary. Request success rate at the load balancer, not "the CPU is fine."
- User-visible. Latency at p99 affects real users. Internal queue depth may not.
- Comparable. You need a meaningful baseline to set a target against.
Poor SLIs: CPU utilisation, memory usage, disk IOPS. These are resources, not user outcomes. A service can have maxed-out CPU and still serve requests correctly.
Good SLIs: HTTP 2xx rate, p50/p99/p999 latency, RPC error rate, data freshness in seconds.
Setting the SLO
Start with measurement, not aspiration. What is your current baseline? If you've been at 99.95% for six months with no deliberate effort, a 99.9% SLO is easy to meet but not useful. A 99.95% SLO creates productive tension.
The SLO should be:
- Achievable — based on historical data plus a meaningful improvement target.
- Slightly tighter than the SLA — so you catch violations before the customer does.
- Small enough in number — track one or two core SLIs per service, not twenty.
The compact between product and SRE
Error budgets only work when both sides respect the compact. Product agrees: if the budget is gone, new features pause. SRE agrees: if the budget is healthy, they support shipping velocity without excessive gate-keeping.
Without that agreement, the error budget is just a dashboard number. With it, it's how you decide what to do next.
Tools in the wild
5 tools- librarySlothfree tier
Generates Prometheus SLO recording + alerting rules from a simple YAML spec.
- specOpenSLOfree tier
Vendor-neutral spec for declaring SLOs as code; consumed by Sloth, Nobl9, and Datadog.
- service
Hosted SLO platform — ingests metrics from Datadog, Prometheus, NewRelic, etc.
- service
Built-in SLO tracking with monitor-based, metric-based, and time-slice variants.
- serviceGrafana SLOfree tier
SLO builder layered on Grafana Cloud's metrics + alerting stack.