databases · level 8

Database Migrations

Expand-contract, online schema-change, and blue/green for the schema itself.

200 XP

Database Migrations

Schema changes are the most operationally fraught thing engineers do regularly. The discipline isn't about being clever; it's about being mechanical. Every migration is two versions of the application in disagreement for at least one deploy cycle, and the job is to keep that disagreement non-fatal.

Analogy

Imagine renovating a building while it's still in use. The wrong way: shut the building down for a weekend, knock down the wall, hope it goes well. The right way: build the new wall behind the old one, run all the wiring through both for a while, route people through the new way once it's tested, then take down the old wall. Slower, more steps, but the building never closes — and at any point you can stop and the building is still standing.

The cardinal rule

Every migration step must be independently deployable and independently rollback-safe.

Build that as a hard rule and the rest of the discipline follows. It rules out:

Big-bang migrations that combine schema and code changes.
"Just rename this column" — old code breaks the moment it deploys.
"Drop this column, then update the code" — old code breaks until the new code is fully rolled out.

The pattern that satisfies this rule is expand-contract (also known as parallel change).

Expand-contract

Five steps, deployed independently, none of them require downtime:

Expand: add the new shape alongside the old. (e.g. add a new column.)
Dual-write: writers update both the old and new shape. The new shape is built up via app code, triggers, or a background backfill.
Backfill: populate the new shape for all historical rows.
Cut over reads: readers switch to the new shape. The old shape is still maintained by writers.
Contract: stop dual-writing; drop the old shape.

Each step is a small, reversible deploy. At any point you can stop, and the application still works.

A concrete example

You want to rename users.subscription to users.tier because you've changed what it represents.

-- Step 1 (deploy A — schema): add the new column, nullable.
ALTER TABLE users ADD COLUMN tier text;

-- Step 2 (deploy B — app): every UPDATE/INSERT writes BOTH columns.
--   when a user changes their plan, set both subscription and tier.

-- Step 3 (background): backfill the historical rows in batches.
UPDATE users SET tier = subscription WHERE tier IS NULL AND id BETWEEN 0 AND 10000;
-- ... repeat for the rest of the id range, throttled.

-- Step 4 (deploy C — app): readers consume tier; writers stop populating subscription.

-- Step 5 (deploy D — schema): drop the old column.
ALTER TABLE users DROP COLUMN subscription;

Five small deploys. At every intermediate state, both old and new code work.

Big tables, big locks

ALTER TABLE on a billion-row table can lock writes for hours on MySQL or block reads on some Postgres operations. Two paths:

Postgres: most ALTERs are now metadata-only

PG 11+ made many DDL operations instant for new columns:

ADD COLUMN ... DEFAULT <const> — instant (PG 11+); the default is stored in metadata, not written to every row.
ADD COLUMN (no default) — instant. Use this and backfill in batches.
ALTER COLUMN ... SET NOT NULL — needs a full table scan; do it as ADD CONSTRAINT NOT VALID then VALIDATE separately.
DROP COLUMN — instant; the column is hidden but the bytes stay until the next VACUUM FULL.

The general technique on Postgres:

-- 1. Add nullable column (instant).
ALTER TABLE users ADD COLUMN tier text;

-- 2. Backfill in batches (no long lock).
UPDATE users SET tier = 'free' WHERE tier IS NULL AND id BETWEEN 0 AND 10000;

-- 3. Add a CHECK constraint as NOT VALID (instant; doesn't scan the table).
ALTER TABLE users ADD CONSTRAINT users_tier_not_null CHECK (tier IS NOT NULL) NOT VALID;

-- 4. Validate the constraint in a separate transaction (scans, but doesn't lock writes).
ALTER TABLE users VALIDATE CONSTRAINT users_tier_not_null;

-- 5. (optional, later) Promote to a real NOT NULL column once you trust it.
ALTER TABLE users ALTER COLUMN tier SET NOT NULL;

MySQL: use online schema-change tools

MySQL's ALTER TABLE historically locked the table. The community built tools to do it online:

gh-ost (GitHub) — tails the binlog, copies rows into a shadow table, atomic rename at cutover. No triggers; foreign-key-aware.
pt-online-schema-change (Percona) — uses triggers on the original table to mirror writes into a shadow table. Older but widely supported.

A typical gh-ost invocation:

gh-ost \
  --user=root --password=... --host=db.local \
  --database=app \
  --table=orders \
  --alter='ADD COLUMN tier VARCHAR(16) NOT NULL DEFAULT "free"' \
  --max-load='Threads_running=25' \
  --critical-load='Threads_running=200' \
  --throttle-control-replicas='replica1.local,replica2.local' \
  --execute

The migration runs in the background, throttles when load is high, and atomically renames at the end. For multi-hour migrations on busy tables, this is the only sane approach.

Renames — never in place

Never ALTER TABLE ... RENAME COLUMN ... TO ... on a column that any service reads. The instant the schema deploy goes out, every service that hasn't deployed the new code starts erroring.

Always:

-- 1. Add new column.
ALTER TABLE users ADD COLUMN email_lower text;

-- 2. Backfill.
UPDATE users SET email_lower = LOWER(email) WHERE email_lower IS NULL;

-- 3. App: dual-write (new code writes both).
-- 4. App: switch reads to new column.
-- 5. App: stop writing to old.
-- 6. Schema: drop old column.

Type changes — don't

Changing a column's type (e.g. int to bigint) on a big table is rarely safe. Add a new column of the new type, dual-write, backfill, switch reads, drop the old. The expand-contract pattern works for type changes the same as for renames.

Foreign keys: be careful

Adding a foreign key to a big table can lock both tables for the duration of the validation. The Postgres trick:

-- 1. Add the constraint NOT VALID — only checks new rows.
ALTER TABLE orders ADD CONSTRAINT fk_user FOREIGN KEY (user_id) REFERENCES users(id) NOT VALID;

-- 2. Backfill: ensure every existing orders.user_id has a matching users.id.
SELECT o.* FROM orders o LEFT JOIN users u ON u.id = o.user_id WHERE u.id IS NULL;
-- (fix any orphans)

-- 3. Validate — scans both tables, but doesn't take an exclusive lock.
ALTER TABLE orders VALIDATE CONSTRAINT fk_user;

Adding indexes online

Postgres: CREATE INDEX CONCURRENTLY builds the index without blocking writes. Slower than the regular CREATE INDEX, but doesn't take an exclusive lock. Use this for any production table.

MySQL: ALGORITHM=INPLACE, LOCK=NONE for B-tree indexes; gh-ost / pt-osc for anything more invasive.

Blue/green for the schema

When expand-contract is too risky — e.g. multi-table refactors that need to be atomic — you can blue/green the database itself:

Stand up a parallel database with the new schema.
Sync data from old → new via logical replication or a CDC pipeline.
Cut over: stop writes briefly, drain the queue, switch the application connection.
Run new schema; old is the standby.
Decommission the old once you trust the new.

This is heavy, but it's the right answer for major schema overhauls. Tools: AWS DMS, Debezium, native logical replication.

Migration runners

Don't write migration scripts ad-hoc. Use a runner:

Flyway / Liquibase — JVM-flavoured but database-agnostic; versioned SQL files in source control.
Alembic (Python / SQLAlchemy) — Python migration runner with autogeneration.
Knex / Prisma migrate / TypeORM migrations — Node ecosystem.
Diesel migrations — Rust.
golang-migrate — Go.

The runner gives you: version tracking (which migrations have run), rollback metadata, repeatable execution.

Common pitfalls

Treating a migration runner as the cap on safety. Runners don't enforce expand-contract; they just execute SQL. The discipline is yours.

Forgetting to backfill in batches. A single UPDATE users SET tier = 'free' on a billion-row table holds row locks for hours. Batch by id range.

Skipping the dual-write window. "Backfill, then deploy code that uses the new column" — what about the writes that happened during the backfill? Always dual-write through the cutover.

Dropping the old column too early. Wait at least one full deploy cycle after the read-cutover before dropping. Roll-back during cutover should still find the old column populated.

No rollback plan. Every migration deploy needs an answer to "if this goes wrong, what's the SQL to undo it without losing data?" Often the answer is "nothing" because expand-contract makes each step roll-forward-able — but make that explicit.

Index bloat from forgotten unused indexes. Old indexes from prior schemas sit there costing write performance. Audit pg_stat_user_indexes (Postgres) or INFORMATION_SCHEMA.STATISTICS (MySQL) for unused ones.

A safe rollout checklist

For every migration:

Is each step independently deployable?
Is each step rollback-safe?
Does it use online tools (gh-ost / CREATE INDEX CONCURRENTLY) where the table is big?
Has the backfill been tested on a snapshot of production-sized data?
Is there a runbook documenting the deploy order across schema and app changes?
Has someone reviewed the migration who didn't write it?

Tick those for every change and most "the migration broke production" stories disappear.

Tools in the wild

5 tools

gh-ostfree tier
GitHub's binlog-tailing online schema-change for MySQL — production-tested at huge scale.
cli
pt-online-schema-changefree tier
Percona's trigger-based online schema-change tool; classic alternative to gh-ost.
cli
pgrollfree tier
Postgres-native expand-contract: maintains both schema versions simultaneously via views.
cli
Liquibase / Flywayfree tier
Migration runners with versioned migration files; integrate into CI.
library
Supabase / Hasura migrationsfree tier
Postgres-first migration tooling with branching.
library