Design a Paste-Bin Service
Prompt
Our support engineers keep dumping logs and stack traces into a shared chat to share them, which is a mess. They want an internal "paste" tool: drop in some text, get a link, send the link to a colleague. Some pastes are huge crash dumps, some should expire, and a few are sensitive. Build it.
How this round runs
The brief mentions "huge," "expire," and "sensitive" in passing but doesn't pin any of them down. You drive: turn those hints into real requirements, then design it, and I'll push you to go deep on where the bytes actually live and how expiry really works, and to commit to a trade-off.
Model answer
1. Requirements I'd surface first.
- How big is "huge"? A 200KB stack trace and a 2GB heap dump want completely different storage. I'd ask for a size distribution and a hard cap. This decides DB-vs-blob (see deep-dive).
- Expiry semantics: user-chosen TTL? burn-after-read? a default max retention? "Some should expire" needs to become "expiry is optional, defaults to 30 days, max 1 year, and burn-after-read is a checkbox."
- Access control: "a few are sensitive" → is a paste public-by-link, private to the author, or shared with named colleagues? For an internal tool I'd default to "anyone in the company with the link," with a private option.
- Edit/immutable? Pastes are usually write-once; if they're editable I have to think about consistency on overwrite — I'd confirm they're immutable, which simplifies everything.
- Read:write ratio + volume: support tooling is read-light per paste (a couple of colleagues view it) but write-steady. Say 50k pastes/day, kept ~30 days.
Non-functional: create and view both fast; sensitive pastes must not be enumerable; expired or burned pastes must be unreadable immediately, not "eventually."
2. High-level design.
client --> API server --> metadata store (small rows: id, owner, expiry, acl, blob-ref)
|
+--> blob store (the actual paste bytes) [for large pastes]
|
short-id generator (base62) expiry sweeper (background job)
Key move: split metadata from the bytes. The metadata store holds small,
queryable rows (short id, owner, visibility, expires_at, a pointer to the bytes);
the bytes live wherever they fit best. A short-id generator mints the link
(base62, ~7 chars). A background sweeper handles expiry.
3. Deep-dive: blob storage vs DB, and expiry sweeping.
Where the bytes live. Small pastes (a few KB of log lines) are fine inline in the metadata row — one round trip, no second system. But "huge crash dumps" in a relational row bloat the table, blow up backups, and hurt every query that touches that row. So I'd commit to a size threshold: under ~64KB, store inline; over it, write the bytes to a blob/object store and keep only a reference in the row. Object storage is built for large immutable objects, scales independently, and is cheap. The cost is a second system and a two-step read (row, then blob), but it keeps the metadata store small and fast — which every other query depends on.
Expiry, done right. The naive "a cron job deletes expired rows" has a gap: a
paste that expired five minutes ago is still readable until the sweep runs. For the
sensitive/burn-after-read case that's a correctness bug, not a cleanup nicety. So
expiry is two layers: (a) check expires_at on every read and return 404
the instant a paste is past expiry, regardless of whether the sweeper has run — and
evict it from cache so a stale cached copy can't serve it; (b) a background
sweeper that actually reclaims storage (deletes the row and the blob) so the
system doesn't accumulate dead data. Burn-after-read deletes (or tombstones) within
the same transaction as the first successful read. The read-time check is for
correctness; the sweeper is for space.
4. A committed trade-off and its cost. I'd commit to the 64KB inline / blob-store-above threshold rather than "everything in the DB" or "everything in blob." The cost I name out loud: most pastes (small ones) get the simple one-system, one-round-trip path, but I've added a branch in the code and an eventual-cleanup window where a blob can briefly outlive its metadata row if a delete half-fails — so I make blob deletion idempotent and let the sweeper re-reap orphans. I accept a little operational reconciliation in exchange for never bloating the metadata store with multi-megabyte rows. If the team later says "pastes are always small," I'd collapse back to inline-only and delete the blob path.
5. Operational concerns. Failure mode: the sweeper crashes or falls behind,
and expired/blob data piles up — storage grows unbounded and, worse, a stale cache
might still serve content the read-time check should have blocked. I'd detect it
with a metric on "rows past expires_at still present" and alert when that backlog
grows. Because the read-time check is the real guard, a lagging sweeper is a
space/cost problem, not a data-exposure one — that's exactly why I made the read
path the source of truth. Rollback: the service is stateless; bad deploys roll back
by image, and the immutable-paste model means there's no in-place data to migrate.
- Turned the vague 'huge / expire / sensitive' hints into concrete size cap, TTL semantics, and ACL requirements
- Split metadata from bytes and justified a size threshold for inline-vs-blob storage
- Caught that cron-only expiry leaves a readable-after-expiry gap, and made the read path the correctness guard
- Committed to the 64KB threshold and named its cost (a branch plus orphan-blob reconciliation)
- Distinguished read-time expiry (correctness) from the sweeper (space reclamation)
- 'A paste expired one minute ago but the sweep runs hourly — is it readable?' → no, the read-time expires_at check returns 404 immediately and evicts cache
- 'Where do the bytes for a 1GB dump go?' → blob/object store with a reference in the row; inline only below the threshold
- 'What breaks at 10x pastes/day?' → the sweeper backlog and blob-store write throughput; partition the sweep and batch deletes
- 'How do you make a sensitive paste non-enumerable?' → unguessable id (keyed-permutation, not sequential) plus an ACL check, not security-by-obscurity alone