Incident Response
Declare, stabilize, communicate, rollback.
Incident Response
An incident is any event that disrupts or degrades service. How you respond in the first fifteen minutes determines whether a 15-minute outage becomes a 15-hour one.
Analogy
Incident response is an emergency room at the moment a patient rolls in bleeding. The trauma team doesn't stop to debate the patient's childhood diet — they clamp the artery first (stabilize), then order tests (investigate). One person runs the room as lead physician (Incident Commander) and doesn't do procedures themselves; another holds pressure (Ops), another updates the family in the waiting room (Comms), and a scribe writes timestamps on the chart as orders are given. The worst thing anyone can do is silently vanish to "go check something" — in the ER and in an outage, silence is how people die. The post-op review the next morning asks what the system let happen, not which nurse to blame.
The first 15 minutes
Speed matters, but panicked speed makes things worse. The sequence:
- Declare. Name the incident early. Create a channel, assign an Incident Commander. An undeclared incident has no coordination.
- Stabilize. Stop the bleeding before you diagnose. Rollback or mitigation buys time to investigate.
- Communicate. Status page, stakeholder update, internal channel. Everyone who needs to know should know before they ask.
- Investigate. Now look at the why. Metrics, logs, traces. Form a hypothesis, test it, repeat.
Skipping step two to rush to step four is the most common mistake. Customers don't care why their checkout is broken — they care when it will work. Stabilize first.
Roles
Clear roles prevent diffusion of responsibility. The four core roles:
| Role | Responsibility |
|---|---|
| Incident Commander (IC) | Coordinates the response. Assigns tasks, manages pace, makes the final call on escalation and rollback. Does not touch production directly. |
| Operations (Ops) | Executes changes. Makes the rollback happen, runs the query, scales the fleet. Follows IC direction. |
| Communications (Comms) | Writes the status page updates, drafts stakeholder messages, keeps the comms channel legible. |
| Scribe | Documents decisions, actions, timestamps in real time. The incident timeline is written during, not after. |
On a small team, one person sometimes plays two roles. IC + Comms is common. IC + Ops is risky — you cannot coordinate when you're also making production changes.
Declare early
The cost of declaring an incident when it turns out to be nothing is embarrassment. The cost of not declaring when it is something is uncoordinated chaos. Declare early.
A declaration anchors:
- The incident channel (where coordination happens)
- The IC (who is in charge)
- The severity level (how much urgency / who gets paged)
Severities
| Sev | Meaning | Example |
|---|---|---|
| Sev 1 | Service is down or severely degraded for most users | Checkout 0% success rate |
| Sev 2 | Significant partial degradation | 30% of payments failing |
| Sev 3 | Minor degradation, workaround exists | One region slow, others fine |
Sev levels drive escalation and communication cadence, not parallelism. Don't have 20 people on a Sev 3.
Rollback as the first tool
When in doubt, rollback. Not every incident is caused by the last deploy, but a high fraction are. If you deployed in the last few hours and the service degraded, rollback is the fastest path to a stable state that gives you room to diagnose.
Resist the urge to "just fix it forward." A targeted hotfix pushed under pressure, without full test coverage, creates the next incident.
Communication cadence
Silence is uncertainty. Uncertainty creates more pages, more DMs, more interrupt pressure on the people trying to fix things.
A reliable cadence:
- Within 5 minutes of declaration: initial status page acknowledgement. "We are investigating elevated error rates."
- Every 15 minutes during Sev 1/2: update, even if it's "Still investigating, no change to ETA."
- On resolution: "Service restored at HH:MM UTC. Full postmortem in 48 hours."
Short, declarative, factual. No speculation about cause in public updates until confirmed.
The postmortem
Every Sev 1 and 2 gets a postmortem within 48 hours. The goal is learning, not blame. The questions:
- What happened, and when? (Timeline with timestamps)
- Why did it happen? (Root cause, contributing factors)
- Why didn't we catch it faster? (Detection gap)
- What are we changing? (Action items with owners and deadlines)
Blameless postmortems focus on system failures, not individual failures. A person made a mistake because a process, tool, or safeguard let them. Fix the system.
Tools in the wild
5 tools- service
Slack-native incident response: declare, assign roles, run timelines, auto-generate postmortems.
- service
Incident management with runbooks, retrospectives, and status-page integration.
- service
Slack-first incident tooling with strong workflow automation and Jira/Linear bridges.
- service
Atlassian's hosted public status pages; the standard for customer-facing incident comms.
- service
Incident analysis platform — turns Slack incident chatter into structured retros.