Reading Flame Graphs
Where did all the time actually go?
Reading Flame Graphs
A flame graph is a profile in picture form. It shows where a program spent its time — not which line of code looks slow, but which actually was, by sample count.
Reading them is a skill. Once you have it, performance work goes from "stare at the code and guess" to "look at the graph and know."
Analogy
Imagine the running program as a stack of nested function calls. Every few milliseconds, a sampler asks "what's on top right now?" and writes it down. After collecting thousands of samples, you've got a frequency distribution: which call stacks were actually executing, weighted by how often they appeared.
A flame graph is that distribution rendered as a tree. The bottom row is the entry point. Each row above is the next level of nesting. The width of each box is how many samples landed in that function or its descendants. Wide boxes are where the time went.
Anatomy
- X-axis: sample count (proportional to time spent). Functions are sorted alphabetically within their parent — the order isn't time order, it's alphabetical.
- Y-axis: call stack depth. Bottom is the entry; top is the deepest nesting at sample time.
- Each box: one function (or method, or kernel call). Width = how often this function was on the stack.
How sampling profilers work
The profiler runs your code with a timer. Every N milliseconds it interrupts and snapshots the current call stack. After M seconds, you have M × (1000/N) snapshots. Identical stacks are merged; the merged count becomes the box width.
This means the data is statistical, not exact. A 10ms function called 1000 times might land in 0 samples (none of the timer interrupts hit during it) or 100 (interrupts always landed during it). At long enough collection windows the law of large numbers kicks in and the picture stabilises.
Chase the wide, ignore the tall
The single most useful rule. Width = time spent. A deep stack of narrow boxes is one function calling another — none of them are bottlenecks. A wide flat box is where the program is actually spending its time.
A common mistake: optimizing the deepest box because "it's at the bottom of a long chain, must matter." If it's narrow, it doesn't matter. The chain of callers above it represents your investigation path, not your optimization target.
On-CPU vs off-CPU
Standard flame graphs show on-CPU time — the program actively running on a CPU. They don't show time spent waiting on I/O, blocked on locks, sleeping, or in OS scheduler queues.
If your program is slow because it waits on disk, on the network, on a database, on a mutex — a CPU flame graph shows nothing. You'll see fast on-CPU work and miss the wait. Off-CPU profiling needs a different tool (eBPF, perf sched, strace, vendor profilers).
Common patterns
The plateau — one wide function eating everything. Optimization target: that function. High value.
The lattice — many narrow paths, no single wide bar. Optimization is harder; usually means an architectural problem (too much per-request work spread across many small functions).
The hidden slowdown — the function looks fine but its callee dominates. Click in to see the descendant graph; the optimization target is the wide child.
Differential flame graphs
The most useful tool you've never heard of. Render a flame graph for "before deploy" and another for "after." Subtract the two: red boxes are functions that got worse, green are functions that got better, opacity proportional to delta size.
A single differential flame graph can answer "did the deploy regress anything" in 10 seconds. Without it, you're squinting at two separate graphs.
Tools
- Brendan Gregg's FlameGraph.pl — the original. Generate flames from
perfoutput. - speedscope.app — browser-based viewer; opens many profile formats.
- pprof — Google's profile format; native in Go, supported widely.
- Chrome DevTools Performance tab — built-in flame graph for browser profiles.
In the playground
Profile a Slow Service gives you a baseline flame and 6 candidate optimizations. Pick 3, see the differential flame for each. Then identify which 2 of your 3 picks actually helped. Some optimizations are red herrings (they look fast in microbenchmarks but don't move production).
Tools in the wild
6 tools- cliSpeedscopefree tier
Browser-based flamegraph viewer — drop a profile in, get sandwich/left-heavy views instantly.
- cliFlameGraph (Brendan Gregg)free tier
The original Perl scripts that turn folded stacks into SVG flame graphs.
- clipy-spyfree tier
Sampling profiler for Python — runs out-of-process, no code changes required.
- cliasync-profilerfree tier
Low-overhead JVM profiler that produces flame graphs from CPU + alloc samples.
- clipproffree tier
Google's profile viewer — Go's built-in profiler emits its format; reads `perf.data` too.
- servicePyroscope (Grafana)free tier
Continuous profiling backend — store, diff, and query flame graphs over time.