Incident Response
Detect → contain → eradicate → recover → learn. Don't power-off the box.
Incident Response
The pager goes off at 3am. A GuardDuty alert says one of your IAM users is making API calls from an IP in a country no one on your team is in. The instinct is to start typing — revoke keys, kill instances, run scripts. The instinct is wrong.
Effective incident response is a process: the same one, every time, learned by rote so you can execute it under pressure with degraded judgment at 3am. NIST 800-61 codifies the phases. This lesson is the working version.
The phases
Preparation → Detection → Containment → Eradication → Recovery → Lessons
(always-on) (alert fires) (stop the bleed) (kick them out) (back to green) (post-mortem)
Each phase has a clear goal. Skipping or merging them is how incidents become disasters.
1. Preparation (the part nobody respects)
The IR work that matters most happens before the incident. Specifically:
- A runbook for each common incident class — credential leak, ransomware-on-laptop, AWS-key compromise, prod-DB-access from non-prod. The runbook tells the on-call exactly what to do, in numbered steps, with screenshots.
- An out-of-band communication channel — a separate Slack workspace, Signal group, or PagerDuty conferences that doesn't depend on the system being attacked.
- A paging tree — who do we wake up, what's the escalation, who has authority to take prod down.
- Pre-baked credentials for incident response — read-only roles in every account so investigators don't have to ask for permissions during the incident.
- A retainer with an IR firm for major incidents (Mandiant, CrowdStrike, Volexity). Sign before you need them; you can't negotiate fees during a breach.
- Tabletop exercises every quarter. Run a simulated ransomware scenario with the on-call team. Find what's broken in calm conditions.
If your organization can't tell you "where do I find the runbook for X right now," your preparation is broken. Fix that on Monday.
2. Detection
The alert fires. You believe an incident is in progress. The first 5 minutes set the tone:
- Timestamp everything. Open a fresh incident document with the trigger time. Every action, every observation, every call gets a timestamp.
- Do not panic-act. Resist the urge to immediately revoke or shutdown. Most actions destroy evidence. Do read-only investigation first.
- Page the IR coordinator. One person owns the incident from this point forward. They route work, they take decisions, they're the single point of communication.
- Open a clean comms channel. Assume your normal Slack/email could be observed. Spin up a sec-incident-NN channel restricted to responders.
3. Containment — the most misunderstood phase
The goal of containment is stopping the bleeding, not removing the attacker. Specifically:
- Revoke the attacker's network access (security-group rules to drop their IP, IAM denyall on the principal).
- Isolate compromised hosts from the rest of the network (don't power them off — see below).
- Halt the attack's progression. Pause CI deploys. Lock the IAM user. Disable the leaked key.
You are not yet trying to figure out what they did or how they got in. That's the next phase. Containment is "limit further damage now."
The single biggest mistake here is rebooting the compromised box. Rebooting:
- Destroys volatile memory (RAM-resident malware, running attacker processes).
- Often clears /tmp, killing artifacts.
- Can break persistence mechanisms in ways that hide the attack vector.
Instead:
- Network-isolate the host (security group / VLAN / NIC down at hypervisor).
- Capture memory (avml on Linux, Belkasoft RAM Capturer on Windows).
- Snapshot the disk via your cloud's API.
- Hash everything; ship to a write-once evidence store in a separate account.
- Then wipe and rebuild — but only after eradication is complete.
4. Eradication
Now you remove the attacker's footholds. Concrete actions:
- Identify and delete every persistence mechanism: cron jobs, systemd units, startup scripts, IAM users, OAuth tokens, SSH keys, root-CA additions.
- Rotate every secret the attacker could have read. ALL of them. If you can't tell what they accessed, rotate everything in scope.
- Identify every system the attacker touched (lateral movement) and apply the same treatment to each.
- For account compromise: revoke every session token, kill all OAuth grants, force-password-reset, force-MFA-re-enroll.
This is where IR firms earn their fees. Eradication done halfway = you'll be back here in two weeks.
5. Recovery
Rebuild from clean state. Specifically:
- Restore from backups taken before the compromise window. (You did keep backups for the entire compromise window, right? You verified the backups weren't altered, right?)
- Re-issue credentials for restored systems.
- Re-onboard users with new factors.
- Monitor for re-compromise — the attacker may try the same vector again. Your detection is now sharper for it.
- Resume normal operations only with explicit sign-off from the IR coordinator.
6. Lessons learned (the post-mortem)
Within 5 business days of resolution:
- A blameless retrospective document: timeline, contributing factors, what worked, what didn't, what changes prevent the next one.
- Concrete preventive actions assigned to owners with due dates. Not "we should improve detection" — "T. Smith adds GuardDuty IAM-anomaly alert by Q2; J. Lee tightens the s3:* policy by Q2."
- Share with the broader org. Most companies share with security peers (publicly or under NDA). The public ones improve the industry.
Do not skip this phase. The organisations that learn from incidents stop being interesting to attackers.
What to never do
- Power-off a compromised host. Capture memory and disk first.
- Wipe and reinstall before forensics. You destroy evidence; you don't know how they got in.
- Use the compromised system to communicate about the incident. Out-of-band, always.
- Argue about whether to call it an incident. If anyone is asking the question, declare. Downgrading is cheap; missing the window is expensive.
- Hide behind legal. Legal will help write your disclosure; they should not gate your technical response.
- Skip the post-mortem because "we know what happened." You don't, and the team won't fix it.
Calling outside help
The IR ladder, in order:
- Internal IR / SecOps team. If you have one.
- Retainer firm (Mandiant, CrowdStrike, Volexity, Stroz Friedberg). Pre-negotiated, available 24/7 for retainer customers; the cheapest possible insurance.
- Cloud provider security teams. AWS Incident Response, GCP Incident Response — built-in for tickets above a certain severity.
- Law enforcement — FBI in the US (especially for ransomware), Europol in the EU. They're better than reputation suggests; they can request takedowns and trace funds.
- Regulators — required by law for some breach types (SEC 4-day rule for material incidents, GDPR 72-hour rule for personal data, HIPAA, etc). Talk to legal early about disclosure obligations.
Don't try to handle a major breach alone if you have the option. The cost of getting external help is dwarfed by the cost of getting the response wrong.
A typical 24-hour incident timeline
T+0:00 GuardDuty alert: anomalous IAM API calls from unusual geo
T+0:02 On-call paged
T+0:05 IR coordinator declares; #sec-incident-23 channel opened
T+0:08 Read-only CloudTrail review begins
T+0:15 Confirmed: leaked access key in use; principal: ci-deploy-bot
T+0:20 Containment: AttachUserPolicy AWSDenyAll → ci-deploy-bot
T+0:22 EC2 instances launched by the principal in last 4h identified, isolated
T+0:35 Memory + disk snapshots of suspected hosts initiated
T+1:00 Forensic firm engaged via retainer (parallel track)
T+1:30 CloudTrail correlation: key stolen via leaked .env in archived repo
T+2:00 All secrets the bot could have read are rotated
T+3:00 Suspected hosts wiped, rebuilt from gold image
T+5:00 Incident contained; monitoring elevated; SOC handover
T+24:00 Recovery validated; standard alerting resumed
T+5d Post-mortem document published; preventive actions assigned
T+30d Preventive actions closed; tabletop exercise on the same scenario
That shape is what good looks like. The numbers vary; the structure does not.
The cultural prerequisite
Blameless. The post-mortem is about systems and processes, not individuals. The engineer who clicked the phishing link did so because the phishing was good and the training was inadequate. The on-call who took 12 minutes to triage did so because the runbook was three pages of dense prose at 3am. Fix the system; trust the people.
Teams that punish individuals for incidents end up with hidden incidents — quiet rotations, suppressed alerts, unreported anomalies. That's a far worse failure mode. Make it psychologically safe to report and to be the person at the center of an incident, and you'll learn from every one. Make it punishing, and you'll only learn from the catastrophic ones.
Tools in the wild
6 tools- service
On-call paging and runbook execution. The 'who do we wake up' layer of IR.
- cliVolatilityfree tier
Memory-forensics framework — analyze RAM dumps for malware artifacts.
- cliAVMLfree tier
Live Linux memory acquisition — captures RAM without rebooting.
- serviceGRR Rapid Responsefree tier
Live forensics and incident response across fleets — Google's open-source tool.
- service
Retainer firms — bring in expertise during big incidents.
- service
Built-in IR triage for AWS — graph-based investigation of CloudTrail + VPC flow logs.