Change Management
Schedule risky work around the freezes — with rollback in hand.
Change Management
Most production incidents trace back to a change someone made. The job of change management isn't to prevent change — it's to land risky work at a time, in an order, and with a rollback path that makes failures recoverable instead of catastrophic.
This level gives you a week (168 hours) and a list of changes — some of them dependent on each other, all of them fighting for the quiet slots around the blackouts. Pack the calendar.
Analogy
Think of a change calendar like a surgery schedule. You don't do the big bowel resection on Christmas Eve; you don't start a procedure the hospital knows they'll have to reverse before rounds tomorrow; and you never open a patient without a clear plan for closing them back up.
Same three rules for infrastructure:
- When — schedule around freezes. Quarter close, Black Friday, wedding-day demos.
- What order — finish the migration before the app deploy; flush the cache only after the new code is live.
- Rollback — every change has a pre-written, tested way to undo itself, in less time than the change itself took.
The three dimensions of a change
| Dimension | Question it answers | Example |
|---|---|---|
| Blast radius | If this goes wrong, who is affected? | low = one non-prod service, high = the payment path |
| Reversibility | How long would it take to roll back? | DB drop column: hours. Feature flag toggle: seconds. |
| Coordination | Who else needs to know? | Solo hotfix, pair review, full CAB approval |
The three dimensions collapse into one scheduling decision: high-blast + slow-rollback + cross-team coordination = land on a Tuesday morning with a war room; low-blast + instant-rollback + solo = ship it now.
Change windows and blackouts
A change window is a pre-agreed stretch of time where risky work is allowed — typically off-peak, with enough headroom before the next peak that rollback (if needed) has slack. A blackout is the inverse: a pre-agreed stretch where no change is allowed.
Classic blackouts:
- Quarter close — finance books are closing; a billing bug compounds.
- Weekend freeze — Friday 5pm → Monday 9am. Don't break the weekend.
- Peak-hour freeze — e-commerce during Black Friday, news during elections.
- Organisational blackouts — all-hands, customer demos, the CEO is at a conference and can't approve emergencies.
The scheduler in the playground respects both: your placement must finish before a blackout starts, or wait until one ends.
Dependency DAG
Changes often depend on each other in a fixed order:
db-migrate → app-deploy → cache-flush
The migration adds columns. The app deploy reads them. The cache flush evicts stale rows keyed the old way. Run them out of order and you get a five-minute outage at best, data corruption at worst.
Topologically sorting a DAG is a classic algorithms problem (Kahn's algorithm, O(V+E)). The ITIL-shaped wrinkle is that nodes also carry duration, blackout constraints, and blast-radius metadata — so the "right" order is the one that respects the DAG and fits the windows and front-loads the high-risk work into slots with the most rollback room.
Rollback planning
A change without a rollback plan isn't a change — it's a one-way door. Before touching prod, answer:
- How do I know this is broken? — specific signal (latency, error rate, customer report), not "I have a bad feeling."
- How do I undo it? — specific command, runbook, PR revert, DB migration-down. Tested on staging, not "should work in theory."
- How long does rollback take? — shorter than the change itself, ideally instant.
- Who decides? — one named person with the authority to pull the trigger without polling five others.
Reversible-by-design > carefully-monitored-irreversible. Feature flags, blue/green deploys, gradual rollouts, read replicas you can promote — these aren't nice-to-haves; they are the difference between a 2-minute blip and a 2-hour war room.
CAB at startup scale
ITIL's Change Advisory Board is a committee that reviews and approves changes. At a 50,000-person enterprise, it meets weekly and gatekeeps every production touch. At a 20-person startup, a full CAB is absurd overhead — but the function a CAB provides still matters:
- Pre-mortem — someone other than the author looks for failure modes.
- Blackout awareness — "oh, that's during the demo."
- Dependency reality check — "you know the billing team is also shipping tonight?"
- Rollback review — "walk me through the undo plan."
Startup-scale equivalents: a #changes Slack channel with a standard template, a brief pair review before merging high-risk PRs, a Monday 09:30 "what's shipping this week" standup. Same function, a tenth the ceremony.
Change vs incident
A change is planned, authorised, reversible-by-design. An incident is an unplanned break — you find out about it, not the other way around.
The relationship is closer than teams often admit:
- A badly-planned change becomes an incident — the 2am page is the tell.
- An incident frequently requires an emergency change to resolve (hotfix, rollback, scale-up).
- The post-incident review should produce better change controls — if the same class of change keeps breaking, the process failed, not just the human.
Emergency changes get their own lightweight approval path (one on-call + one sanity-checker), but they should be rare. A team that's always shipping "emergency changes" has a planning problem, not an execution problem.
Real tools you'll see
- Jira Service Management (Change) — ticket-driven CAB, calendar view, integrates with bitbucket/github for automated "did this actually ship" updates.
- ServiceNow Change Management — the enterprise default; full CAB workflows, risk scoring, integration with the CMDB.
- Freshservice Changes — mid-market alternative; approval chains + calendar + impact analysis.
- PagerDuty Change Events — lightweight "record that a deploy happened" feed; pairs with incident timelines to answer "what changed just before this page?"
- GitHub Environments + required reviewers — the minimum-viable CAB for engineering-led orgs.
Why changes fail (the root-cause spectrum)
When a change goes sideways, post-incident analysis almost always points at one of these:
- Incomplete dependency map — "we didn't know Service Y talked to Service X."
- No rollback path — "the migration was irreversible and we hadn't tested the forward-fix."
- Wrong window — "we ran it at 4pm on a Friday."
- Silent blast-radius creep — "the shared library change affected 14 services, not just the one we tested."
- Approval theatre — "the CAB approved it but nobody actually reviewed the diff."
- Drift between prod and staging — "it worked in staging."
Gotchas
- Emergency-change abuse — if "emergency" is the only way to ship fast, the normal process is broken. Fix the process, don't route everything through the emergency lane.
- Change-shaped incidents — a deploy that technically succeeded but degraded latency by 3x is still a change-caused incident. "The pipeline went green" is not a success criterion.
- Invisible dependencies — cron jobs, cached IAM policies, DNS TTLs, warm connection pools. Your DAG is incomplete until you've asked "what might break when this runs today that wouldn't have broken yesterday?"
- Rollback rot — rollback plans decay. If you haven't rehearsed the rollback in the last quarter, assume it's broken.
Playground
The calendar on the next tab gives you a week, a set of changes, and a set of blackouts. Pick a change, click an hour slot, watch the live conflict report. The Auto-schedule button runs the reference packer — a good comparison once you've tried it yourself.
Visualizer
The Change Dependency Graph panel draws the same change set as a DAG, colour-coded by blast radius, with dashed rings on any change whose slot would collide with a blackout.