it · level 10

Change Management

Schedule risky work around the freezes — with rollback in hand.

170 XP

Change Management

Most production incidents trace back to a change someone made. The job of change management isn't to prevent change — it's to land risky work at a time, in an order, and with a rollback path that makes failures recoverable instead of catastrophic.

This level gives you a week (168 hours) and a list of changes — some of them dependent on each other, all of them fighting for the quiet slots around the blackouts. Pack the calendar.

Analogy

Think of a change calendar like a surgery schedule. You don't do the big bowel resection on Christmas Eve; you don't start a procedure the hospital knows they'll have to reverse before rounds tomorrow; and you never open a patient without a clear plan for closing them back up.

Same three rules for infrastructure:

  • When — schedule around freezes. Quarter close, Black Friday, wedding-day demos.
  • What order — finish the migration before the app deploy; flush the cache only after the new code is live.
  • Rollback — every change has a pre-written, tested way to undo itself, in less time than the change itself took.

The three dimensions of a change

Dimension Question it answers Example
Blast radius If this goes wrong, who is affected? low = one non-prod service, high = the payment path
Reversibility How long would it take to roll back? DB drop column: hours. Feature flag toggle: seconds.
Coordination Who else needs to know? Solo hotfix, pair review, full CAB approval

The three dimensions collapse into one scheduling decision: high-blast + slow-rollback + cross-team coordination = land on a Tuesday morning with a war room; low-blast + instant-rollback + solo = ship it now.

Change windows and blackouts

A change window is a pre-agreed stretch of time where risky work is allowed — typically off-peak, with enough headroom before the next peak that rollback (if needed) has slack. A blackout is the inverse: a pre-agreed stretch where no change is allowed.

Classic blackouts:

  • Quarter close — finance books are closing; a billing bug compounds.
  • Weekend freeze — Friday 5pm → Monday 9am. Don't break the weekend.
  • Peak-hour freeze — e-commerce during Black Friday, news during elections.
  • Organisational blackouts — all-hands, customer demos, the CEO is at a conference and can't approve emergencies.

The scheduler in the playground respects both: your placement must finish before a blackout starts, or wait until one ends.

Dependency DAG

Changes often depend on each other in a fixed order:

db-migrate  →  app-deploy  →  cache-flush

The migration adds columns. The app deploy reads them. The cache flush evicts stale rows keyed the old way. Run them out of order and you get a five-minute outage at best, data corruption at worst.

Topologically sorting a DAG is a classic algorithms problem (Kahn's algorithm, O(V+E)). The ITIL-shaped wrinkle is that nodes also carry duration, blackout constraints, and blast-radius metadata — so the "right" order is the one that respects the DAG and fits the windows and front-loads the high-risk work into slots with the most rollback room.

Rollback planning

A change without a rollback plan isn't a change — it's a one-way door. Before touching prod, answer:

  1. How do I know this is broken? — specific signal (latency, error rate, customer report), not "I have a bad feeling."
  2. How do I undo it? — specific command, runbook, PR revert, DB migration-down. Tested on staging, not "should work in theory."
  3. How long does rollback take? — shorter than the change itself, ideally instant.
  4. Who decides? — one named person with the authority to pull the trigger without polling five others.

Reversible-by-design > carefully-monitored-irreversible. Feature flags, blue/green deploys, gradual rollouts, read replicas you can promote — these aren't nice-to-haves; they are the difference between a 2-minute blip and a 2-hour war room.

CAB at startup scale

ITIL's Change Advisory Board is a committee that reviews and approves changes. At a 50,000-person enterprise, it meets weekly and gatekeeps every production touch. At a 20-person startup, a full CAB is absurd overhead — but the function a CAB provides still matters:

  • Pre-mortem — someone other than the author looks for failure modes.
  • Blackout awareness — "oh, that's during the demo."
  • Dependency reality check — "you know the billing team is also shipping tonight?"
  • Rollback review — "walk me through the undo plan."

Startup-scale equivalents: a #changes Slack channel with a standard template, a brief pair review before merging high-risk PRs, a Monday 09:30 "what's shipping this week" standup. Same function, a tenth the ceremony.

Change vs incident

A change is planned, authorised, reversible-by-design. An incident is an unplanned break — you find out about it, not the other way around.

The relationship is closer than teams often admit:

  • A badly-planned change becomes an incident — the 2am page is the tell.
  • An incident frequently requires an emergency change to resolve (hotfix, rollback, scale-up).
  • The post-incident review should produce better change controls — if the same class of change keeps breaking, the process failed, not just the human.

Emergency changes get their own lightweight approval path (one on-call + one sanity-checker), but they should be rare. A team that's always shipping "emergency changes" has a planning problem, not an execution problem.

Real tools you'll see

  • Jira Service Management (Change) — ticket-driven CAB, calendar view, integrates with bitbucket/github for automated "did this actually ship" updates.
  • ServiceNow Change Management — the enterprise default; full CAB workflows, risk scoring, integration with the CMDB.
  • Freshservice Changes — mid-market alternative; approval chains + calendar + impact analysis.
  • PagerDuty Change Events — lightweight "record that a deploy happened" feed; pairs with incident timelines to answer "what changed just before this page?"
  • GitHub Environments + required reviewers — the minimum-viable CAB for engineering-led orgs.

Why changes fail (the root-cause spectrum)

When a change goes sideways, post-incident analysis almost always points at one of these:

  1. Incomplete dependency map — "we didn't know Service Y talked to Service X."
  2. No rollback path — "the migration was irreversible and we hadn't tested the forward-fix."
  3. Wrong window — "we ran it at 4pm on a Friday."
  4. Silent blast-radius creep — "the shared library change affected 14 services, not just the one we tested."
  5. Approval theatre — "the CAB approved it but nobody actually reviewed the diff."
  6. Drift between prod and staging — "it worked in staging."

Gotchas

  • Emergency-change abuse — if "emergency" is the only way to ship fast, the normal process is broken. Fix the process, don't route everything through the emergency lane.
  • Change-shaped incidents — a deploy that technically succeeded but degraded latency by 3x is still a change-caused incident. "The pipeline went green" is not a success criterion.
  • Invisible dependencies — cron jobs, cached IAM policies, DNS TTLs, warm connection pools. Your DAG is incomplete until you've asked "what might break when this runs today that wouldn't have broken yesterday?"
  • Rollback rot — rollback plans decay. If you haven't rehearsed the rollback in the last quarter, assume it's broken.

Playground

The calendar on the next tab gives you a week, a set of changes, and a set of blackouts. Pick a change, click an hour slot, watch the live conflict report. The Auto-schedule button runs the reference packer — a good comparison once you've tried it yourself.

Visualizer

The Change Dependency Graph panel draws the same change set as a DAG, colour-coded by blast radius, with dashed rings on any change whose slot would collide with a blackout.