system-design · level 9

Capstone — URL Shortener

Architect the classic 100M-URL system, layer by layer.

200 XP

Capstone — Designing a URL Shortener

This is the canonical system-design interview problem because it touches everything. Load balancers, caches, sharded data stores, idempotency keys, rate limits, async event streams, capacity math — every layer of the stack lights up. By the end of this lesson the diagram should fall out of your head whole, with each box justified by a number you can compute on the spot.

We'll build it the way a senior engineer actually thinks through it: numbers first, write path, read path, then everything that hangs off the side.

The product

POST /shorten { "url": "https://example.com/very/long/path?q=42" } → { "short": "https://sho.rt/aB3xY9" }.

GET https://sho.rt/aB3xY9 → 301 Location: https://example.com/very/long/path?q=42.

That's it. Behind that 6-character path is a system that needs to scale to billions of redirects, never lose a mapping, and resolve a click in less time than it takes the user's browser to render.

Capacity math

Always start here. Numbers turn vague handwaves into concrete trade-offs.

100 million URLs total (the size of every interesting "what if Bitly?" estimate).
100:1 read-to-write ratio — every URL is shorted once and clicked many times.
Reads spread roughly evenly across 24 hours (peak ≈ 3× average).

Read QPS:

100M × 100 reads = 10B reads/day
10B / 86400 s = 115,740 reads/sec average
× 3 peak factor = ~350K reads/sec at peak

Write QPS:

100M URLs / 86400 / 30 days assumed-create-window ≈ 38 writes/sec average
× 5 peak factor = ~200 writes/sec at peak

Reads are 1000× hotter than writes. Every architecture decision should optimise the read path first.

Latency budget: p99 < 50ms. Most of that is network — your stack has maybe 10ms to do its work.

Storage: each row ≈ { short_code: 8B, long_url: ~80B avg, owner_id: 8B, created_at: 8B } ≈ 200B. 100M rows × 200B = ~20GB. Trivial. The KV layer fits on a single node — until you remember the read QPS, at which point you shard for throughput, not size.

Short-code generation

Two mainstream approaches:

Counter + base62

Maintain a global monotonic counter. Encode it in base62 (0-9 a-z A-Z = 62 chars). The N-th URL gets short code base62(N).

counter   short_code
1         "1"
61        "Z"
62        "10"
3,521     "rD"
3.5e9     "5dP21"   (5 chars)
2.2e11    "5xP21q"  (6 chars)

Pros: dense keyspace, no collisions, predictable size growth. 6 chars = 62^6 ≈ 56 billion (more than enough). 7 chars = 62^7 ≈ 3.5 trillion (galactic overkill).

Cons: codes are guessable — aB3xY9 is followed by aB3xZ0. If short-codes leak business secrets ("the 1000th coupon code") that's a problem. Solve with a non-monotonic permutation (Feistel cipher on the counter, base62-encoded).

The trick: don't INCREMENT a global counter per write — that doesn't scale. Hand each ID-generator instance a range of IDs (e.g. 1M at a time) and let it mint codes locally. Snowflake IDs, ZooKeeper id-allocator, AWS Snowflake-style services all do this.

Hash truncation

Hash the long URL (SHA-256), take the first 6 base62 chars.

Pros: same long URL always produces the same short code (deduplication is free).

Cons: collisions exist. With 6 chars (56B keyspace) the birthday-paradox break-even is ~7.5M URLs. After that, every new URL has a non-trivial chance of colliding; you must check the DB on every write.

Most production systems pick counter+base62 with a per-instance range. Hash truncation is a fallback for "if this is shorter than the existing entry."

The write path

client ─▶ LB ─▶ rate limiter ─▶ idempotency ─▶ web ─▶ ID allocator ─▶ DB write
                                                        │
                                                        └─▶ cache populate (optional)

Rate limit per user — 10 writes/minute on the free tier, more for paid. Token bucket per API key.
Idempotency key — client supplies a UUID; if they retry, they get the same short_code back. Critical for the "share a URL" case where a network blip would otherwise create two short codes for the same URL.
Optional custom slug — /shorten { url, slug: "mytalk" }. Validate uniqueness with a transactional INSERT ... ON CONFLICT DO NOTHING. Reject 409 on collision.
DB write — single row insert. Sharded by short_code first character (or first 2) so reads can route directly.

Write throughput is ~200/s at peak. A single Postgres node handles this in its sleep. The write path is the EASY half of this problem.

The read path (the 99%)

client ─▶ CDN edge ─▶ LB ─▶ web ─▶ cache ──hit──▶ 301 redirect
                                       │
                                       └─miss─▶ DB ─▶ populate cache ─▶ 301 redirect
                                                       │
                                                       └─▶ event stream (analytics)

Every layer matters:

CDN edge

For URLs that get hammered (a viral tweet's link, a product launch), the CDN can serve the 301 directly without ever reaching your origin. Cloudflare Workers KV is purpose-built for this — replicate the short_code → long_url map to every edge POP, redirect at the edge in under 10ms.

Load balancer

Standard L7 with health checks. Round-robin or least-connections; the requests are uniform.

Cache (Redis)

The single most important component for read latency. Read-through pattern:

GET url:aB3xY9 from Redis (sub-millisecond).
Cache hit (99%+) → return.
Cache miss → DB lookup → SETEX url:aB3xY9 86400 https://... → return.

Hit rate matters enormously. With a 99% hit rate and the math 0.99 × 0.5ms + 0.01 × 50ms = 0.995ms, average latency is ~1ms. Drop the hit rate to 90% and you're at ~5.5ms — 5× worse.

Why such a high hit rate is achievable: URL access follows a Zipfian distribution. The top 1% of URLs account for ~50% of clicks. They live in cache permanently.

Database

Sharded KV (DynamoDB, Cassandra, ScyllaDB) or sharded SQL by short_code prefix. The lookup is SELECT long_url FROM urls WHERE short_code = ? — a single index hit. With 8-shard cluster and 100M rows, each shard holds 12.5M rows — utterly fine.

For the rare scan use case (analytics), have a separate replica or warehouse — never scan the hot read store.

Async analytics

Click events go to a stream (Kafka, Kinesis, GCP Pub/Sub) after the redirect is sent. The web server fires a non-blocking write to the stream and returns the 301. The analytics service consumes the stream at its own pace — if it falls behind, redirects stay fast.

The crucial discipline: never block the redirect on analytics. Synchronous click logging is the #1 way to tank p99 on a URL shortener.

Putting it together

The full architecture, in one diagram:

                  ┌────── CDN edge KV (top 1% of URLs) ──────┐
                  │                                           │
client ─▶ DNS ────┼─▶ L7 LB ─▶ web (stateless) ─▶ Redis ─miss─▶ Sharded KV ─▶ 301
                  │                  │                              │
                  │                  ├─async─▶ Kafka ─▶ analytics ─▶ ClickHouse
                  │                  │
                  └─writes─▶ rate limit ─▶ idempotency ─▶ ID alloc ─▶ Sharded KV

Every layer has a job:

Layer	Why
CDN edge KV	Serve the hottest URLs from microseconds-from-the-user.
L7 LB	Distribute load, handle TLS, route by Host.
Stateless web	Horizontal scale, easy deploy, easy roll.
Redis cache	99%+ hit rate; absorbs most of the read QPS.
Sharded KV / SQL	Source of truth, sized for write throughput + storage.
ID allocator	Mints short codes without per-write coordination.
Kafka	Decouples redirect path from analytics.
Rate limiter	Per-user quota; protects the write path from abuse.
Idempotency layer	Same client + same key = same result, even on retry.

Trade-offs the interviewer will probe

"What if you used SQL instead of a KV store?" Fine for 100M rows and ~200 writes/s. Sharding becomes manual; reads cost a network hop more than DynamoDB. Most teams pick managed KV for the operational simplicity.

"What if you skipped the cache?" Read latency goes from ~1ms to ~50ms. Database read load goes 100×. Cache is non-negotiable for this workload.

"What if analytics were synchronous?" p99 latency includes the analytics write. If Kafka is slow, redirects are slow. Decoupling is the whole point.

"How do you handle a viral URL?" Top URLs go to the CDN. Cache TTL is short (5 min) but most viral URLs are repeatedly refreshed within that window. Edge KV serves them from POPs.

"What about deletion / expiration?" Soft-delete column + a TTL on the cache key. The redirect handler checks the deleted flag before returning 301; deleted URLs return 410 Gone.

"How do you prevent abuse (URL shorteners are link-of-choice for phishing)?" Per-user write rate limit + a malicious-URL classifier that checks newly created URLs against threat-intel feeds (Google Safe Browsing, PhishTank). Block before serving.

What can go wrong

The stampede on a viral URL. A trending URL expires from cache; 10K simultaneous misses all hit the DB. Solve with single-flight (one DB call shared by all concurrent misses) or longer TTL on hot keys.

Counter contention. A naive global counter INCR url_counter is a single-row hot-spot. Always use range allocation.

Sharding by long_url. Don't. The lookup is by short_code; sharding by long_url means every read is a scatter-gather.

Skipping the idempotency key on writes. A client retry creates a duplicate short code for the same URL. Trivial bug, embarrassing in demo.

Counting clicks in the redirect handler synchronously. Tanks p99 the moment your analytics DB has any hiccup. Stream it.

No malicious-URL filter. URL shorteners get abused for phishing within hours of launch. Not having a filter is irresponsible.

Why this is the canonical interview question

A URL shortener forces you to demonstrate:

Capacity reasoning — can you derive QPS from product numbers?
Multi-layer caching — do you understand why cache at every layer, with what TTL?
Sharding — can you choose a shard key for the dominant query pattern?
Idempotency — do you make writes safe to retry?
Async fanout — do you keep the hot path off the slow path?
Trade-off articulation — for every component, can you justify it with a number?

The "right" answer isn't the fanciest architecture — it's the one where every box is the cheapest possible solution that meets the math you computed two minutes ago. Master this and you can architect any read-heavy public service.

Common tools in production

AWS DynamoDB / Google Bigtable / Cassandra — sharded KV stores at the source-of-truth layer.
Redis / ElastiCache — hot-path cache with sub-ms reads.
Cloudflare Workers + KV / Vercel Edge Config — globally replicated edge KV for the top URLs.
Apache Kafka / AWS Kinesis / GCP Pub/Sub — event stream for click analytics.
ClickHouse / BigQuery / Snowflake — analytics store consuming the click stream.
Bitly's actual stack (per their public talks) — sharded MySQL + Memcached + Redis + Storm/Kafka for analytics.

Diagram conventions

The minimum viable diagram for this problem:

client → LB → web → Redis → DB
                      │       ↓
                      └──────▶Kafka → analytics

In Mermaid:

flowchart LR
    C[client] --> CDN
    CDN --> LB
    LB --> Web[stateless web]
    Web --> R[(Redis)]
    R -.miss.-> DB[(Sharded KV)]
    Web -.async.-> K[(Kafka)]
    K --> A[Analytics]

That diagram, drawn in 90 seconds with capacity numbers labelled on each edge, is the answer to almost any "design a high-traffic read-heavy service" question.

Tools in the wild

4 tools

AWS DynamoDB
Auto-sharded KV store. Single-digit-millisecond reads at any scale. The natural backend for URL shorteners.
service
Redisfree tier
The hot-path cache. 99%+ hit rate keeps DB load bounded and p99 latency tiny.
service
Cloudflare Workers + KV
Edge-deployed compute + replicated KV. Run the redirect at the CDN edge, microseconds from the user.
service
Apache Kafkafree tier
Event stream for click analytics fanout. Decouples write throughput of redirects from analytics ingestion.
service