databases · level 7

NoSQL Tradeoffs

Picking by access pattern, not by hype.

200 XP

NoSQL Tradeoffs

NoSQL isn't a database. It's a category of "we built something that isn't relational because relational was wrong for our access pattern." The skill is identifying which shape your access pattern wants — and not picking by hype.

Analogy

A relational database is the local council records office: every fact filed under a strict schema, cross-referenced, joined on demand, audited. It's the right tool when you need to look up everyone's birthday by month for tax purposes. But if your job is to phone the next person on a long list one at a time (key-value), or to find every "friend of a friend of a friend" in a social network (graph), or to ingest a million temperature readings per second from sensors (time-series), the records office is not your tool. NoSQL means "I picked a different building".

The five families

Key-value

The simplest model: key → value. The value is opaque (a string, a JSON blob, whatever your app understands). No queries beyond get(key), put(key, value), delete(key).

Examples: DynamoDB, Redis, etcd, Riak.

When it fits:

Single-key lookups: user-by-id, session-by-token, page-by-url.
Hot caches.
Massive scale where joins are out of scope by design.

When it's wrong: anything that needs to query by anything other than the key. The day you find yourself "scanning" the table to filter, you've outgrown KV.

Document

Each record is a JSON-shaped document. The DB indexes individual fields and supports secondary queries by content.

Examples: MongoDB, Couchbase, Firestore, DocumentDB (AWS).

When it fits:

Workloads where the schema is heterogeneous: optional fields, deeply nested JSON, schema-on-write doesn't fit.
Mostly-by-id reads but occasional secondary queries by content.
Rapid early-stage product where schema is fluid.

When it's wrong: heavy-join workloads, transactions across many documents, strong relational integrity requirements. The "schema-less" benefit becomes a liability when your app has 50 services all reading and writing the same documents.

Realistically, if all you need is "JSON in a column with a content index", Postgres jsonb + GIN index does this without adding a second database.

Graph

Records are nodes; edges connect them; queries traverse paths.

Examples: Neo4j, Amazon Neptune, TigerGraph, JanusGraph.

When it fits:

Relationships dominate: friends-of-friends, recommendation graphs, fraud rings.
Variable-length traversals: "shortest path", "all reachable nodes", "common ancestors".
Workloads where SQL would join the same table 5+ times.

When it's wrong: tabular data with one or two joins. Postgres handles 1–2-hop queries trivially. Reach for a graph DB when the depth is unbounded or the relationship-to-data ratio is high.

Wide-column

Sparse rows with potentially huge numbers of columns; partitioned by row key, sorted within partition by column key.

Examples: Cassandra, ScyllaDB, HBase, Bigtable.

When it fits:

Massive write throughput at scale (Discord, Netflix viewing history).
Tunable consistency knobs (Cassandra QUORUM / ALL / ONE).
Time-series-shaped writes where each row is a wide partition over time.

When it's wrong: mid-scale workloads where the operational complexity isn't justified. Cassandra is hard to run correctly; if you're not at the scale where it pays off, choose a simpler tool.

Time-series

Append-only data keyed by time, with built-in window aggregations, downsampling, and retention.

Examples: TimescaleDB (Postgres extension), InfluxDB, AWS Timestream, VictoriaMetrics.

When it fits:

Sensor / metrics / event streams: 1k–1M points/sec, mostly recent, downsample over time.
Queries are window aggregations over time ranges.
Old data tiers down to cheap storage; retention policies.

When it's wrong: random-access reads to historical data, transactional updates. Time-series DBs are tuned for sequential append + bucketed range scans.

Schema-less is a lie

The "no schema" pitch sounds great until your app gets to ~100 fields per document and three services need to agree on what they all mean. The schema doesn't disappear; it moves:

Relational: schema is in CREATE TABLE statements, enforced by the DB.
NoSQL: schema is in your application code (validators, ORMs, types) and reader bugs.

Both are valid. The relational version is harder to evolve quickly; the NoSQL version is harder to enforce consistently. Mature NoSQL teams add some kind of schema validation layer (JSON Schema, Avro, Protobuf) anyway because "schema in head" doesn't scale past a few engineers.

Picking by access pattern (not by hype)

The discipline:

List the access patterns before picking a database. "Read user by id", "read all orders for user since timestamp X", "search products by full-text", etc.
Estimate the volumes: writes/sec, reads/sec, p99 latency budget.
Sketch the data model in each candidate:
- Postgres: tables and indexes.
- DynamoDB: PK/SK structure.
- Mongo: documents and indexes.
Reject candidates that can't answer all the patterns at the required latency.
Pick the simplest of what's left.

Postgres covers an enormous range of access patterns surprisingly well: jsonb for documents, GIN for content search, pgvector for embeddings, pg_trgm for fuzzy match, foreign data wrappers for joining external systems. The reflex to "go NoSQL" is often misplaced — the question is "does Postgres also handle this with one extension".

When polyglot is right

Sometimes you actually do need multiple databases. The shape:

Postgres for the transactional core (orders, users, billing).
Redis for hot caches and session storage.
Elasticsearch / OpenSearch for full-text search and log analytics.
DynamoDB for one specific high-traffic KV access pattern.

The cost of polyglot: every database is its own ops, monitoring, backup, security review, on-call expertise. Below a certain scale, "Postgres for everything" plus careful indexing beats a 4-database architecture every time.

CAP-ish considerations

The CAP-theorem framing oversimplifies, but its core concern is real: every distributed database trades consistency, availability, and partition tolerance.

DynamoDB / Cassandra: Available, partition-tolerant, eventual consistency by default. (You can pay for stronger consistency per-call.)
Postgres replicas: Consistent on the primary, available on replicas, with replication lag.
Spanner: Globally consistent, partition-tolerant, with synchronous replication. The trade is write latency.

The choice isn't theoretical — it shows up in your access patterns. "Read your own writes" requires either reading from the primary or having a cache that bridges replication lag. Test it explicitly; don't trust the marketing pages.

Common bugs

Treating a document store like Postgres. Cross-document transactions are limited, joins are awkward, schema drift is a constant tax. If your data is fundamentally relational, use a relational DB.

Treating Postgres like a key-value store. Worth the comparison: Postgres at modest scale serves single-key reads in low single-digit ms with a B-tree. You don't need DynamoDB for "find user by id" until you genuinely have the scale.

Hot partitions in DynamoDB / Cassandra. A poorly-chosen partition key concentrates traffic on one node. A high-traffic user's id, an "active orders" status — these become hot. Choose partition keys that distribute evenly under your real traffic.

ScansInsteadOfQueries. DynamoDB Scan is the equivalent of Postgres Seq Scan — full table read. Almost always a sign you've designed the schema wrong; rework the access pattern + GSI before scanning.

Joining in application code. "I'll just fetch from N tables and combine in app" works at small scale. At any traffic, the network overhead of N round-trips kills you. Use the database's native join, or denormalise.

A reasonable defaults table

Workload	First reach
OLTP, relational, modest scale	Postgres
Massive single-key lookups, AWS-native	DynamoDB
Hot cache layer	Redis / ElastiCache
Full-text search	Postgres + GIN, then OpenSearch
Time-series metrics	TimescaleDB / Prometheus
Graph traversals	Postgres + recursive CTEs, then Neo4j
Vector / embedding search	Postgres + pgvector, then Pinecone / Qdrant
Heavy analytical aggregations	Redshift / BigQuery / ClickHouse

The "Postgres first, specialised later" pattern works for ~80% of teams. Specialise only when the access pattern provably outgrows it.

Tools in the wild

6 tools

MongoDBfree tier
Document store; secondary indexes; aggregation framework.
library
DynamoDB
Serverless key-value with single-table-design idioms; sub-10ms p99.
service
Redisfree tier
In-memory KV + structures (lists, sorted sets, streams); also a great cache.
library
Apache Cassandra / ScyllaDBfree tier
Wide-column distributed DB; tunable consistency; massive horizontal scale.
library
Neo4jfree tier
Property-graph DB with Cypher query language.
library
TimescaleDB / InfluxDBfree tier
Time-series databases for append-only metric / event workloads.
library