it · level 10

Change Management

Schedule risky work around the freezes — with rollback in hand.

170 XP

Change Management

Most production incidents trace back to a change someone made. The job of change management isn't to prevent change — it's to land risky work at a time, in an order, and with a rollback path that makes failures recoverable instead of catastrophic.

This level gives you a week (168 hours) and a list of changes — some of them dependent on each other, all of them fighting for the quiet slots around the blackouts. Pack the calendar.

Analogy

Think of a change calendar like a surgery schedule. You don't do the big bowel resection on Christmas Eve; you don't start a procedure the hospital knows they'll have to reverse before rounds tomorrow; and you never open a patient without a clear plan for closing them back up.

Same three rules for infrastructure:

When — schedule around freezes. Quarter close, Black Friday, wedding-day demos.
What order — finish the migration before the app deploy; flush the cache only after the new code is live.
Rollback — every change has a pre-written, tested way to undo itself, in less time than the change itself took.

The three dimensions of a change

Dimension	Question it answers	Example
Blast radius	If this goes wrong, who is affected?	`low` = one non-prod service, `high` = the payment path
Reversibility	How long would it take to roll back?	DB drop column: hours. Feature flag toggle: seconds.
Coordination	Who else needs to know?	Solo hotfix, pair review, full CAB approval

The three dimensions collapse into one scheduling decision: high-blast + slow-rollback + cross-team coordination = land on a Tuesday morning with a war room; low-blast + instant-rollback + solo = ship it now.

Change windows and blackouts

A change window is a pre-agreed stretch of time where risky work is allowed — typically off-peak, with enough headroom before the next peak that rollback (if needed) has slack. A blackout is the inverse: a pre-agreed stretch where no change is allowed.

Classic blackouts:

Quarter close — finance books are closing; a billing bug compounds.
Weekend freeze — Friday 5pm → Monday 9am. Don't break the weekend.
Peak-hour freeze — e-commerce during Black Friday, news during elections.
Organisational blackouts — all-hands, customer demos, the CEO is at a conference and can't approve emergencies.

The scheduler in the playground respects both: your placement must finish before a blackout starts, or wait until one ends.

Dependency DAG

Changes often depend on each other in a fixed order:

db-migrate  →  app-deploy  →  cache-flush

The migration adds columns. The app deploy reads them. The cache flush evicts stale rows keyed the old way. Run them out of order and you get a five-minute outage at best, data corruption at worst.

Topologically sorting a DAG is a classic algorithms problem (Kahn's algorithm, O(V+E)). The ITIL-shaped wrinkle is that nodes also carry duration, blackout constraints, and blast-radius metadata — so the "right" order is the one that respects the DAG and fits the windows and front-loads the high-risk work into slots with the most rollback room.

Rollback planning

A change without a rollback plan isn't a change — it's a one-way door. Before touching prod, answer:

How do I know this is broken? — specific signal (latency, error rate, customer report), not "I have a bad feeling."
How do I undo it? — specific command, runbook, PR revert, DB migration-down. Tested on staging, not "should work in theory."
How long does rollback take? — shorter than the change itself, ideally instant.
Who decides? — one named person with the authority to pull the trigger without polling five others.

Reversible-by-design > carefully-monitored-irreversible. Feature flags, blue/green deploys, gradual rollouts, read replicas you can promote — these aren't nice-to-haves; they are the difference between a 2-minute blip and a 2-hour war room.

CAB at startup scale

ITIL's Change Advisory Board is a committee that reviews and approves changes. At a 50,000-person enterprise, it meets weekly and gatekeeps every production touch. At a 20-person startup, a full CAB is absurd overhead — but the function a CAB provides still matters:

Pre-mortem — someone other than the author looks for failure modes.
Blackout awareness — "oh, that's during the demo."
Dependency reality check — "you know the billing team is also shipping tonight?"
Rollback review — "walk me through the undo plan."

Startup-scale equivalents: a #changes Slack channel with a standard template, a brief pair review before merging high-risk PRs, a Monday 09:30 "what's shipping this week" standup. Same function, a tenth the ceremony.

Change vs incident

A change is planned, authorised, reversible-by-design. An incident is an unplanned break — you find out about it, not the other way around.

The relationship is closer than teams often admit:

A badly-planned change becomes an incident — the 2am page is the tell.
An incident frequently requires an emergency change to resolve (hotfix, rollback, scale-up).
The post-incident review should produce better change controls — if the same class of change keeps breaking, the process failed, not just the human.

Emergency changes get their own lightweight approval path (one on-call + one sanity-checker), but they should be rare. A team that's always shipping "emergency changes" has a planning problem, not an execution problem.

Real tools you'll see

Jira Service Management (Change) — ticket-driven CAB, calendar view, integrates with bitbucket/github for automated "did this actually ship" updates.
ServiceNow Change Management — the enterprise default; full CAB workflows, risk scoring, integration with the CMDB.
Freshservice Changes — mid-market alternative; approval chains + calendar + impact analysis.
PagerDuty Change Events — lightweight "record that a deploy happened" feed; pairs with incident timelines to answer "what changed just before this page?"
GitHub Environments + required reviewers — the minimum-viable CAB for engineering-led orgs.

Why changes fail (the root-cause spectrum)

When a change goes sideways, post-incident analysis almost always points at one of these:

Incomplete dependency map — "we didn't know Service Y talked to Service X."
No rollback path — "the migration was irreversible and we hadn't tested the forward-fix."
Wrong window — "we ran it at 4pm on a Friday."
Silent blast-radius creep — "the shared library change affected 14 services, not just the one we tested."
Approval theatre — "the CAB approved it but nobody actually reviewed the diff."
Drift between prod and staging — "it worked in staging."

Gotchas

Emergency-change abuse — if "emergency" is the only way to ship fast, the normal process is broken. Fix the process, don't route everything through the emergency lane.
Change-shaped incidents — a deploy that technically succeeded but degraded latency by 3x is still a change-caused incident. "The pipeline went green" is not a success criterion.
Invisible dependencies — cron jobs, cached IAM policies, DNS TTLs, warm connection pools. Your DAG is incomplete until you've asked "what might break when this runs today that wouldn't have broken yesterday?"
Rollback rot — rollback plans decay. If you haven't rehearsed the rollback in the last quarter, assume it's broken.

Playground

The calendar on the next tab gives you a week, a set of changes, and a set of blackouts. Pick a change, click an hour slot, watch the live conflict report. The Auto-schedule button runs the reference packer — a good comparison once you've tried it yourself.

Visualizer

The Change Dependency Graph panel draws the same change set as a DAG, colour-coded by blast radius, with dashed rings on any change whose slot would collide with a blackout.