sre · level 4

On-Call

Alert hygiene, runbooks, and sustainable rotations.

200 XP

On-Call

Being on-call is part of owning a service. Done well, it's manageable. Done poorly, it burns people out in months.

Analogy

On-call is a volunteer fire brigade rotation in a small town. When the siren goes off at 2am, whoever is on shift that week drags on their boots and drives to the station — not because they'll fight every fire alone, but because somebody has to show up first and decide whether this needs one truck or a full mutual-aid call. The runbook is the laminated card by the door: "grease fire → class K extinguisher, never water." If the smoke alarm on the firehouse roof keeps tripping because of passing trains, and the brigade rolls out for it every time, within a month nobody's sprinting to the truck anymore — and the night there's a real fire, the delay is measured in buildings. Healthy rotations have enough volunteers that nobody carries the pager more than one week in four, and every false alarm gets investigated the next morning, not ignored.

What on-call actually means

On-call means you are the first responder for your service during your rotation. When an alert fires, you investigate, mitigate, and escalate if needed. You are not expected to fix everything alone — you are expected to respond quickly and coordinate effectively.

A healthy on-call rotation:

Pages are actionable. Every alert that fires is something a human needs to act on.
The schedule is shared. No single person carries the rotation indefinitely.
Runbooks exist. You should not need to figure out remediation steps under pressure at 3 AM.

Alert hygiene

The biggest threat to on-call sustainability is alert fatigue. When too many alerts fire, engineers start ignoring them — including the real ones.

Actionable alerts are alerts where every page requires a human decision. If the correct response to an alert is "wait and see" or "it always fires at this time of day," the alert is noise. Delete it or convert it to a low-priority ticket.

Principles:

Alert on symptoms, not causes. Page on "checkout error rate > 1%" not "database CPU > 80%". The database might be busy and still serving requests fine.
Alert on SLO consumption. If burn rate exceeds 2× for 1 hour, page. If burn rate exceeds 14× for 5 minutes, page immediately (fast burn). This maps directly to error budget impact.
Reduce alert frequency before increasing severity. A noisy Sev 3 that fires 20 times a week is worse than a well-tuned Sev 1 that fires twice a year.

Runbooks

A runbook is a document that tells you what to do when a specific alert fires. It should answer:

What does this alert mean?
What is the user impact?
What are the first things to check?
What are the mitigation steps?
Who do you escalate to, and when?

Runbooks must be maintained. An outdated runbook that points to a service that no longer exists is worse than no runbook — it wastes time you don't have.

Link runbooks directly from alerts. When a page fires, the engineer should reach the relevant runbook in two clicks.

Handoffs

The on-call shift ends, but the context lives on. A good handoff prevents incidents from falling through the cracks.

A handoff note covers:

Any open or recently resolved incidents
Known fragile services or elevated risk (a deploy scheduled for tomorrow, a config change that's been acting up)
Any workarounds currently in place
Outstanding action items

Handoff is a ritual, not an email. A 10-minute verbal or video handoff is worth more than a written paragraph when the situation is complex.

Pager fatigue

Pager fatigue is cumulative. An engineer who gets paged twice a night for a week arrives at the end of the rotation exhausted, prone to mistakes, and possibly looking for a new job.

Measuring fatigue:

Pages per on-call shift (target: < 2/day outside business hours)
Time-to-acknowledge (a rising P90 TTack means people are sleeping through pages)
Alert-to-incident ratio (if 80% of pages turn out to need no action, your alerts are too noisy)

Addressing fatigue requires investment from leadership, not just the on-call engineer. The engineer can fix one alert. Fixing the culture of shipping without runbooks, without SLOs, without alerting review is an org-level commitment.

Healthy rotations

Practice	Why it matters
Minimum 4 people in rotation	No one is on call more than 1 week in 4
Follow-the-sun (distributed teams)	Business-hours pages only for each engineer
Compensation for on-call shifts	Acknowledged as real work
Post-incident alert review	Every incident triggers a review: should the alert have been different?
Shadow rotations for new engineers	Learn the system before carrying the pager alone

The feedback loop

Every alert that fires is data. Track it. Review alert frequency monthly. The healthiest on-call rotations treat noisy alerts as bugs, not background noise.

Tools in the wild

5 tools

PagerDuty
Industry-standard paging with rotations, escalations, overrides, and on-call analytics.
service
Opsgenie
Atlassian's PagerDuty competitor; tight Jira and Statuspage integration.
service
Grafana OnCallfree tier
Open-source on-call scheduling that integrates with Grafana alerting; hosted tier available.
service
Squadcast
Affordable PagerDuty alternative with built-in incident response and SLOs.
service
Better Stack Uptime
Combined uptime monitoring + on-call rotations + status pages in a single product.
service