Discovering and Fixing Bottlenecks
Prompt
Tell me about a performance problem you tracked down and fixed in a system you worked on. Walk me from the first symptom to knowing it was actually solved.
How this round runs
I'll follow your story and drill on the chain: how you found the real bottleneck, how you fixed it, and — the part most people skip — how you know your change is what fixed it and not something else that shifted at the same time. Pick a problem you measured your way through, not one you guessed at.
Model answer
Pick a real performance problem you owned through the whole arc — detect, diagnose, fix, verify — with numbers you can recall. Lead with the symptom and its blast radius: "p99 on the export endpoint walked from five seconds to five minutes as traffic grew", not "it got slow". The strong answer is built on measurement, so narrate the instrument at each step: what metric or profile told you where the time actually went, rather than where you assumed it went. The common miss is fixing the thing you suspected (add a cache) when the real cost was elsewhere (a lock, an N-plus-one, full table scans) — show that you let the profiler, not your prior, pick the target.
The drill that separates this round: "how do you know it was your change?" A deploy rarely lands alone — traffic dips, a cache warms, another team ships, the load test isn't the same shape as production. A strong answer controls for the confound: you held the input fixed (replayed the same query set, the same load profile), changed one thing, compared before-and-after on the same metric, and can say the latency drop tracked the rollout and reverted in the canary that didn't get the change. Then name the next bottleneck — the constraint that becomes tightest once yours is gone — because real systems just move the queue. Keep it first-person and quantified: "I cut p99 from 5 min to 1.2 min, database load down ~80%".
- Led with a concrete symptom and blast radius, with real numbers
- Found the bottleneck by measurement — profile or metric — not by guessing
- Fixed the actual constraint, not the one they first suspected
- Controlled for confounds: held input fixed, changed one thing, compared on the same metric, watched it track the rollout
- Named the next bottleneck and quantified the win
- 'How did you find the bottleneck?' → a profile or metric that pointed at the real cost, not an assumption
- 'How do you KNOW it was your change, not a confound?' → fixed input, one variable changed, before/after on the same metric, tracked the rollout and held in a canary
- 'What's the next bottleneck once this one's gone?' → the constraint that becomes tightest next, because the queue just moves