cloud · level 9

Multi-Region

Active-active vs pilot-light vs cold-DR — and the RTO/RPO trade.

200 XP

Multi-Region

The hardest mode of cloud operation. Most architectures that say "multi-region" on the org chart actually mean "we have replication configured but have never failed over". The gap between aspirational and exercised is where outages live.

Analogy

A multi-region setup is like having a fully-staffed second restaurant in another city. Active-active is keeping both open with the same menu and the same staff: expensive, but if one closes, customers walk to the other and barely notice. Active-passive is the second restaurant being staffed and stocked but only opens when the main one closes — minutes of "we're opening, please bear with us". Pilot light is having a chef and the freezer running but no front-of-house — opening is hours of work. Cold DR is a building with a sign and a key under the mat — you have to call the staff in, restock, and reopen from scratch. Each one is a defensible choice. Pretending you have the first when you actually have the third is how disasters get worse.

RTO and RPO — the two numbers

Every DR conversation should start with these. Driven by the business, not engineering.

  • RTO (Recovery Time Objective) — "How long can the system be down?" Drives the recovery posture (active-active vs cold-DR).
  • RPO (Recovery Point Objective) — "How much data can we afford to lose?" Drives the replication strategy (sync vs async, frequency).

Examples:

  • Online banking: RTO 5 min, RPO 0. Multi-region active-active or near-synchronous replication.
  • E-commerce checkout: RTO 30 min, RPO 5 min. Active-passive with sub-5-min replication lag.
  • Internal HR tool: RTO 4 hours, RPO 1 hour. Pilot light or even cold DR.
  • Marketing blog: RTO 1 day, RPO 1 day. Backup to S3, restore from snapshot.

The bigger your RTO/RPO budgets, the cheaper your DR. Every halving of the numbers roughly doubles the cost.

The four DR postures

Ordered from most-expensive-to-cheapest, fastest-to-slowest:

1. Active-active

Both regions serve live traffic. Writes go to whichever region the user is closer to. Replication is bidirectional.

  • RTO: seconds (DNS / Global Accelerator failover).
  • RPO: 0 to seconds (depending on replication tech).
  • Cost: 2× the steady-state, plus cross-region data transfer.

The catch: conflict resolution. For most stateful systems, two simultaneous writes to the same record from different regions don't merge cleanly. Active-active works for systems built for it: DynamoDB Global Tables (last-writer-wins), Spanner (linearised globally), Cassandra. It's brittle for general SQL.

2. Active-passive (warm standby)

One region serves all traffic. The standby region runs the same software, replicas are warm, and the load balancer fails over on health-check loss.

  • RTO: 1–5 minutes typical.
  • RPO: replication lag (seconds for sync, sub-minute for async).
  • Cost: ~1.5× steady-state.

The most common production posture. Aurora Global Database, Route 53 failover, ElastiCache Global Datastore — all designed for this shape.

3. Pilot light

Only the data tier runs in the secondary region. Compute is scaled to zero or minimal. On failover, IaC scales the rest of the stack up.

  • RTO: tens of minutes to a few hours (you're spinning up infrastructure).
  • RPO: replication lag (seconds for sync).
  • Cost: data tier + minimal compute. Maybe 1.1× steady-state.

A reasonable mid-tier choice for systems where minutes-to-hours of downtime is acceptable.

4. Cold DR (backup-and-restore)

Snapshots / backups are replicated cross-region. No infrastructure runs in the secondary region by default. Recovery is provisioning everything from IaC + restoring data from snapshots.

  • RTO: hours to a day.
  • RPO: backup frequency (typically 1 hour to 24 hours).
  • Cost: cross-region S3 storage + backup APIs. Marginal.

Acceptable for non-critical workloads where the business can tolerate a long outage.

A decision framework

What's the RTO?
├── seconds-to-minutes → active-active or active-passive
│       └── stateful workload that doesn't conflict-merge?
│           ├── yes → active-passive
│           └── no  → active-active (DynamoDB GT, Spanner, etc.)
├── tens of minutes  → pilot light
└── hours+           → cold DR / backup-restore

The "multi-region" trap on day 1

Most teams add cross-region replication early because "we should have multi-region from the start". The result:

  • The replication is configured but never exercised.
  • The failover runbook says "promote replica" but no one knows the exact CLI invocations under stress.
  • The IaC has dev / staging / prod for one region, but the DR region has drift (manual changes, untested overrides).
  • Cost is 1.5–2× without the actual benefit.

The fix isn't "don't do multi-region". The fix is: chaos-test the failover regularly. Once a quarter, declare a region "down" and exercise the runbook. If the runbook doesn't survive the drill, the multi-region is theatre, not resilience.

A pragmatic order:

  1. Single-region with rock-solid Multi-AZ. Most outages are AZ-level, and AZ failover is well-trodden.
  2. Cross-region snapshots / backups. Cheap insurance; covers the "region just gone" case poorly but at all.
  3. Pilot-light secondary when business explicitly says "RTO must be under 1h".
  4. Active-passive when RTO drops below 30 minutes.
  5. Active-active only when (a) the data model supports it and (b) RTO requirements demand it.

Skipping step 1 to claim active-active is the most common mistake.

What actually replicates cross-region

AWS service Cross-region replication mechanism Lag
RDS Cross-region read replica (async) seconds-to-minutes
Aurora Global Storage-level async replication <1s
DynamoDB Global Tables Active-active multi-master <1s
S3 Cross-Region Replication (CRR) seconds
EFS Replication for filesystems <1 hour
EBS Snapshots with cross-region copy per snapshot
Route 53 Global by definition n/a
Lambda code Deploy via IaC to each region per deploy
ECR images Cross-region replication (push to one, replicates) seconds

Every cross-region service has its own knobs. There's no single "multi-region" toggle.

Cost realities

Cross-region data transfer is not free. Typical AWS pricing:

  • Inter-region transfer (US east ↔ US west): ~$0.02/GB.
  • Database replication traffic: ~$0.02/GB.
  • S3 Cross-Region Replication: replication storage + per-object charges.
  • Idle infrastructure in the standby region: ~50% of full active cost for warm standby.

A 1 TB/day replication is ~$600/month. At 100 TB/day it's $60K/month. Run the numbers before assuming "we'll just replicate everything".

DNS failover details

The mechanism most teams use:

  1. Route 53 health checks the primary endpoint every 30s.
  2. After N consecutive failures, the health check goes "unhealthy".
  3. Route 53 starts returning the secondary's address.
  4. Clients with cached DNS from the primary continue hitting it for TTL seconds — typically 60s for DR records.
  5. New DNS lookups go to the secondary.

Real-world failover takes TTL + retries + health-check window seconds = often 2–5 minutes minimum. AWS Global Accelerator with anycast is faster (sub-30s) but more expensive.

Common pitfalls

Forgetting that the application has region awareness baked in. Hardcoded region names in env vars, IAM ARNs scoped to one region, S3 bucket names with the region in them. Comb these out before claiming multi-region.

Stale runbooks. The failover steps were correct on day one. Two years later, the IaC has changed, the team has rotated, and no one has re-tested. Schedule quarterly drills.

Cross-region transactions. Almost no managed database supports them. If you're trying to write to two regions transactionally, you're building Spanner — and you're probably wrong.

Not testing data restore from backups. Restoring from a snapshot you've never tested is a cliff. Restoring from one you haven't tested in the new region is a worse cliff. Run a quarterly restore drill into a sandbox account.

Bidirectional replication that goes asymmetric. Aurora Global is one-way (primary writes, replicas read). DynamoDB Global Tables is multi-master. Mixing the two semantics in one architecture is asking for split-brain.

A reasonable starting posture

For a company at modest scale:

  1. Multi-AZ everything within a single region. RTO seconds for AZ-level events.
  2. Aurora / RDS automated snapshots with cross-region copy enabled. RPO 1h, RTO 4h for region-level events.
  3. S3 + Cross-Region Replication for everything that doesn't live in a database.
  4. IaC reproducible per-region so you can terraform apply against the secondary and stand the rest of the stack up cold.
  5. Quarterly DR drill that actually fails over for an hour.

That posture gives you defensible compliance ticks, sensible RTO/RPO numbers, and — most importantly — a runbook that someone has actually executed under pressure. Build up from there as the business case justifies it.

Tools in the wild

6 tools