Runbook Execution
Codified steps beat panicked improvisation — walk the tree, don't guess.
Runbook Execution
A runbook is a written, branching procedure for handling a specific problem — "user can't log in", "VPN won't connect", "printer is spooling but nothing comes out". It encodes what a senior technician would do into a decision tree that a tier-1 technician can follow without panicking. Done right, it turns tribal knowledge into a repeatable playbook; done wrong, it's condescending, stale, and ignored.
Why runbooks exist
Three forces push organisations to write them:
- Tribal knowledge dies. The senior tech who always knew "oh, it's the stale Kerberos ticket" leaves the company, and with them go the shortcuts nobody wrote down. A runbook freezes that knowledge in a form that survives turnover.
- 3am alertees aren't thinking clearly. At 3am the on-call is pattern-matching on fumes. A written procedure you can follow mechanically is worth more than cleverness you don't have.
- Audit trails need them. Compliance frameworks (SOC 2, ISO 27001) want evidence that similar incidents get handled consistently. A runbook plus a ticket saying "followed steps 1-5 of the lockout runbook" is that evidence.
Properties of a good runbook
A good runbook has the shape of a decision tree, not a wall of prose. Each step has:
- A clear test. "Check if the user is in the
engineeringgroup" — yes or no, no interpretation needed. - A single next step per answer. If the answer to a test is "sometimes A, sometimes B," the test isn't specific enough yet — split it until each answer has one next action.
- A terminal condition. Every branch eventually ends in either "resolved" (problem fixed, ticket closed) or "escalate" (hand off to tier-2 with these notes). A runbook without a terminal is a procedure you can walk forever, which is what happens when a junior tech gets stuck in a loop on a ticket for six hours.
The playground on the next tab lets you walk two example runbooks — "user can't log in" and "VPN won't connect" — and shows you the decision graph as you go.
The "is it just you / what changed / layers check" meta-framework
Even without a written runbook, three questions get you 80% of the way on almost any incident:
- Is it just you? Is this one user, one laptop, or everyone? Scope collapses the search space — a single-user problem is almost always an account or device issue; a fleet-wide problem is infrastructure.
- What changed? Deploys, patches, cert renewals, DNS tweaks, password rotations — most outages are correlated with a recent change. Checking the change log first is faster than reasoning from first principles.
- Layers check. Walk the stack: physical → network → auth → app. If the user can't log in, can they ping anything? Do they have DNS? Does SSO work at all? Does this one app fail where others succeed? The answer narrows the layer; the runbook for that layer does the rest.
Write runbooks on top of this framework, not instead of it — a junior tech who's memorised the three questions but has no runbook will still out-troubleshoot a runbook-robot who doesn't understand why.
When to deviate from the runbook
A runbook encodes assumptions. When those assumptions are wrong, follow the runbook anyway is the wrong move. Typical cases:
- The runbook assumes cloud, but you're on-prem — e.g. "reset the user's session in Okta" when the org still uses on-prem AD.
- The runbook assumes a specific tool version — e.g. "use the new admin console" on a site still running the old one because upgrade is blocked.
- The runbook was written for one region / tenant / environment and silently doesn't apply to yours.
- The symptom doesn't match the runbook's first test — if the first question is "is the user locked out?" but the user can log in fine and the app is broken for everyone, you're on the wrong runbook.
The discipline: note the mismatch in the ticket, pick a different runbook or escalate, and file a bug against the runbook itself so the next person doesn't fall into the same pit.
Writing runbooks that aren't condescending
The worst runbooks read like "Step 1: Open your web browser. Step 2: Type the URL." They insult the reader and get ignored on the second incident.
- Assume competence at the level of your audience. A tier-1 runbook can assume you know what SSO is. A first-day-orientation runbook can't.
- Link, don't reprint. Instead of pasting a 20-line command, link to the canonical doc. Reprints go stale the moment the doc changes.
- Name tools explicitly. "Reset the user in the admin console" is ambiguous; "Reset the user in Okta → People → user → Actions → Reset Password" is executable.
- Show the expected happy-path output. "You should see
login.microsoft.com/common/oauth2/authorize" lets the reader spot-check — "not seeing that" is itself a branching signal.
Real tools
- PagerDuty attaches runbook URLs to alert definitions — the alert links you straight to the procedure. This is where the "runbook must be at the place the alert fires" discipline comes from.
- Notion / Confluence host the long-form runbook wikis at most companies. Both support decision-tree-style pages with internal anchors.
- GitHub "ops-playbook" repos — a growing convention is to keep runbooks in a git repo next to the code they're about, so PRs that change behaviour also update the runbook. The Zalando and GitLab ops-playbook repos are public examples.
- Backstage (Spotify's dev portal) surfaces per-service runbooks next to the service itself in the catalog.
Common gotchas
- Stale runbooks that reference a UI that no longer exists. Mitigation: date every runbook, require a review every 90 days, link it to the service it's about so changes to the service flag the runbook for review.
- Runbooks that depend on commands only the author can run — the classic is a step that needs SSH access to a bastion host that only the author has keys to. If the runbook only works for you, it isn't a runbook, it's a personal note.
- Runbooks without a terminal condition — procedures that loop forever because every branch says "if still broken, try X again". Every branch must end.
- Runbooks that skip the "verify" step. A fix without verification is a guess. Every resolving leaf should include "confirm the original symptom is gone, then close the ticket with a note saying what you did."
- Runbooks that duplicate each other — "account lockout", "password reset", and "MFA loop" are probably branches of one runbook, not three. Duplication rots faster than any single source of truth.
Playground
Pick one of the two runbooks, read the prompt on the current node, and click the option that matches the situation you're diagnosing. The live cost counter tracks the time you've spent; the path list on the bottom shows the steps you've taken and the current node is highlighted in the DAG on the Visualizer tab. Reset to restart from the start node; switch the runbook to try a different scenario.
Visualizer
The Runbook Graph draws the whole DAG: rectangles for nodes, arrows for choices, your walked path in the accent colour, resolved leaves in green, escalate leaves in amber. It's the same shape the eventual ticket tooling should let you see — "which branch did the technician take?" is a useful post-incident question.