cloud · level 6

Managed Databases

RDS vs Aurora vs DynamoDB vs Spanner — and what you actually buy.

200 XP

Managed Databases

A managed database is just someone else's on-call. You give up some control over the engine, the kernel, the disk; you get back not-being-paged at 3am when storage fails over. For most workloads, that trade is correct.

Analogy

Self-managing a database is owning a building's boiler. You're responsible for the burner, the pump, the pressure valve, the annual inspection, the spare parts cupboard, and the call to the boiler engineer at midnight when the heat goes out. A managed database is leasing a flat with the heating included. You give up the ability to fine-tune the burner; you also give up most of the reasons it would ever wake you up. Either is reasonable. Most people who insist on owning the boiler haven't actually had a heating failure at midnight in January.

The decision tree

The first axis is the shape of your data:

  1. Mostly relational, joins matter, transactions across tables → RDS / Aurora / Spanner
  2. Key-value or single-row lookups, sub-10ms p99 at any scale → DynamoDB / Bigtable
  3. Columnar analytics, terabyte aggregations, BI workloads → Redshift / BigQuery / Snowflake
  4. In-memory cache, sub-ms reads, ephemeral by design → ElastiCache / MemoryDB

The second axis is the scale and consistency requirements:

  • Single-region, single-AZ — RDS Single-AZ is fine.
  • Single-region, HA — RDS Multi-AZ or Aurora.
  • Multi-region read replicas with eventual consistency — Aurora Global, DynamoDB Global Tables.
  • Global strong consistency with transactions — Spanner. AWS doesn't have a true equivalent.

RDS vs Aurora

Both are managed Postgres / MySQL. The architectural difference is the storage layer.

RDS is conventional: a single EBS volume per instance. The standby in Multi-AZ has its own EBS volume; replication is at the SQL level. Failover means promoting the standby (60–120 seconds typical).

Aurora decouples compute from storage. Every Aurora cluster has a single, shared, 6-way-replicated storage volume across 3 AZs. The primary writes to it; up to 15 replicas read directly from it. Failover is sub-30s because there's no replication catch-up — replicas already see the same volume.

Feature RDS Aurora
Engine Postgres / MySQL / others Postgres / MySQL only
Replicas up to 5 read replicas up to 15
Storage replication per-instance EBS, async to standby shared volume, 6-way across 3 AZs
Failover 60–120s <30s
Storage cost gp3 pricing per-GB-month + per-IO
Read scaling replicas have replication lag replicas read same volume; lag in single-digit ms

When to pick RDS: you want exact engine-version control, broad engine support, simpler pricing for low-throughput workloads. When to pick Aurora: you need fast failover, many read replicas, or you'll grow into Aurora Global / Aurora Serverless.

DynamoDB

The model is brutally simple:

  • Partition key (hash) — required.
  • Sort key (range) — optional.
  • Anything else is just attribute on the item.

Every read is one of:

  • GetItem(pk, sk) — single-item lookup.
  • Query(pk, sk-condition) — items within one partition.
  • Scan — table scan, paginated. Don't, basically.

You design the schema for access patterns, not for normalisation. The single-table design pattern (everything in one table, multiple sort-key prefixes per item) is normal and correct.

Capacity modes

  • Provisioned — you declare RCU / WCU; you're billed for them whether you use them or not. Auto-scaling can flex.
  • On-demand (PAY_PER_REQUEST) — billed per request, no provisioning. ~7× more expensive per request, but zero idle cost. The right default for unpredictable workloads.

Global tables and indexes

  • GSI (Global Secondary Index) — alternate access patterns with their own pk/sk. Eventual consistency from base table.
  • LSI (Local Secondary Index) — alternate sort key, same pk. Created at table creation; rarely worth the rigidity.
  • Global tables — multi-region active-active with last-writer-wins. Sub-second cross-region replication.

When to pick DynamoDB: high-traffic key-value workloads, unpredictable scale, and IAM-integrated auth. Don't pick it for join-heavy relational queries.

Spanner — globally consistent SQL

Spanner is unique. Most "global" databases (Aurora Global, DynamoDB Global Tables) replicate eventually. Spanner replicates synchronously, with global ordering, using TrueTime — a synchronised-clock service that bounds the uncertainty in "now" to a few ms. The result: you get SQL transactions with external consistency (any observer sees ordered, linearised commits) across regions.

The trade-off: write latency is the inter-region round-trip plus TrueTime overhead. Single-region writes are 5–10ms typical; multi-region writes are 100ms+.

Spanner is the right choice when: you genuinely need globally-consistent SQL transactions, the cost is justified, and you're already on GCP.

Read replicas — the always-relevant feature

Three reasons read replicas exist:

  1. Spread read load. A primary that's CPU-bound on reads benefits hugely from offloading.
  2. Reporting / analytics without hitting the OLTP primary.
  3. Disaster recovery — a replica in a different region is your warm DR.

Things to know:

  • Replication lag is real. Most workloads tolerate sub-second; some don't. Critical reads should go to the primary.
  • Failover promotion is one-way. Once you promote the replica, it's a new primary; the old one needs to be re-bootstrapped.
  • Pin reads explicitly. Most ORMs/connection libraries can route by query type. Don't rely on the database doing it.

Multi-AZ vs Multi-region

Two different DR postures:

  • Multi-AZ — replicas in different AZs of the same region. Survives an AZ failure. Latency-cheap (a few ms cross-AZ). Default for production.
  • Multi-region — replicas in different regions. Survives a region failure. 50–200ms cross-region latency. Expensive.

Most production systems run Multi-AZ from day one and add Multi-region replication only when:

  • The compliance / business-continuity requirement explicitly demands it.
  • Cross-region read traffic latency justifies replicas closer to users.
  • You've already nailed Multi-AZ and your runbooks include cross-region failover drills.

The shortcut "multi-region everything" is almost always premature.

When the managed database is wrong

Three signals:

  • Latency-sensitive hot loops. A self-managed engine on EC2 / GKE with hand-tuned kernel + io_uring + nvme can beat RDS by ~20% for the right shape of workload.
  • Engine version pinning. RDS lets you pick "MySQL 8.0", but the minor version can roll forward. Some workloads (regulated finance, payment processors) want byte-identical engine versions across years.
  • Custom extensions. RDS Postgres only allows extensions that AWS approves. If you need pg_cron flavoured a specific way, or timescaledb on a specific minor version, sometimes self-managed wins.

For 95% of workloads, the managed service is the right call. The remaining 5% justifies the 24/7 on-call.

A diagnostic checklist

When considering a managed database:

  • What's the access pattern? (KV, relational, analytical, full-text?)
  • What's the traffic shape? (Steady, spiky, scale-to-zero?)
  • What's the latency budget? (Sub-ms? Sub-10ms? Sub-100ms?)
  • What's the consistency requirement? (Eventual, read-after-write, linearisable?)
  • What's the data size? (GB, TB, PB?)
  • Single-region or multi-region?

Answer those six and the choice is usually obvious. Skip them and you'll end up with a relational schema in DynamoDB or a key-value workload chewing through Aurora I/O.

Tools in the wild

6 tools
  • Managed Postgres/MySQL/MariaDB/Oracle/SQL Server. Multi-AZ replication.

    service
  • Postgres/MySQL with shared storage layer; 6-way replication, fast failover.

    service
  • Serverless NoSQL key-value + document store; sub-10ms p99 at any scale.

    service
  • Globally-distributed strongly-consistent SQL. TrueTime-backed transactions.

    service
  • Managed Redis / Memcached; MemoryDB is durable, multi-AZ Redis.

    service
  • Columnar MPP data warehouse for petabyte-scale analytics.

    service