runtimes · level 5

Source to Running

Lexer, parser, AST, bytecode, IR, machine code, loader, runtime.

225 XP

Source to Running

A text file becomes a running process through a pipeline of narrow, well-defined stages. Different languages skip or combine stages, but the names and ordering are stable. Understanding the pipeline is what lets you read an error message and know which tool emitted it.

Analogy

Think of mailing a handwritten letter overseas. The envelope passes through a tight sequence of stations: the local postbox, the sorting office that reads the postcode, the regional hub that bundles all mail for that country, customs, the foreign national carrier, the destination sorting office, the postie on a bike. Each station does one narrow job and complains with its own characteristic sticker when something is wrong — "no postage", "illegible address", "blocked at customs". When a letter never arrives, the return sticker tells you exactly which station rejected it. Compilers are the same: each stage has a signature error voice.

1. Lexer (tokenizer)

Characters in, tokens out. A stream like let x = 42; becomes [LET, IDENT("x"), EQUALS, NUMBER(42), SEMI]. The lexer is a finite-state machine; its only failures are illegal characters or unterminated strings. Compilers use tools like flex, ANTLR, or hand-written lexers (V8's lexer is ~5k lines of hand-tuned C++).

2. Parser

Tokens in, parse tree out. The parser follows the language grammar and assembles tokens into nested productions. Most production compilers use recursive-descent with Pratt parsing for expressions (Clang, V8, Rust's syn crate all do this). A syntax error is the parser's.

parser sees: IDENT = NUMBER
recognizes : assignment → identifier "=" expression

3. AST (abstract syntax tree)

The parse tree, stripped of punctuation. A node for the assignment, children for its target and value. Every semantic pass — type checking, name resolution, desugaring, optimization, codegen — walks the AST. You can dump Python's AST with ast.parse(src) and TypeScript's with ts-ast-viewer.com.

4. IR (intermediate representation)

An AST is tree-shaped and close to source; machine code is flat and linear. Most serious compilers lower through one or more IRs in between. LLVM IR is the most famous — a typed, SSA-form (static single assignment), RISC-like virtual instruction set.

define i32 @add(i32 %a, i32 %b) {
  %sum = add i32 %a, %b
  ret i32 %sum
}

Optimization passes (dead code elim, inlining, loop invariant hoisting) operate at the IR level because it's uniform and easy to transform.

5. Bytecode

Bytecode is a platform-neutral instruction format, designed to be easy to interpret or JIT. .pyc is CPython bytecode; .class is JVM bytecode; .wasm is WebAssembly. A bytecode instruction set is typically stack-based (JVM, CPython, V8 Ignition) or register-based (Lua, Dalvik).

Bytecode is the natural artifact boundary for languages with a VM: ship bytecode to the user, the VM handles translation to native.

6. Machine code

Real CPU instructions — x86-64, ARM64, RISC-V. Produced either ahead-of-time (gcc, rustc, go) writing an ELF/Mach-O/PE binary, or just-in-time (V8, HotSpot, .NET CLR) writing into executable memory at runtime. A .o file contains relocatable machine code; the linker resolves symbols and produces an executable.

7. Loader

The OS kernel's job. On execve, the kernel reads the ELF/Mach-O header, maps the text/data segments into the process's virtual address space, and sets up the stack. The dynamic linker (ld-linux.so on Linux, dyld on macOS) then loads shared libraries and resolves PLT entries. Only after all this does the entry point run.

8. Runtime

The process is executing. The language runtime — garbage collector, allocator, scheduler, exception handler — runs inside the process and provides services the generated code calls into. This is where stack traces, GC pauses, and threading live.

Different languages, different subsets

Language Lexer Parser AST Bytecode IR Machine code Loader Runtime
C yes yes yes often AOT yes minimal (libc)
Rust yes yes yes (HIR/MIR) LLVM IR AOT yes minimal
Go yes yes yes SSA AOT yes GC + scheduler
Java yes yes yes yes (.class) yes (HotSpot IR) JIT yes (classloader) JVM
Python (CPython) yes yes yes yes (.pyc) interpreter
JavaScript (V8) yes yes yes yes (Ignition) yes (TurboFan) JIT V8

C skips bytecode entirely. Python skips native codegen. Java and V8 do everything — which is why they're so complex and so interesting.

Where errors come from

When you see an error, ask which stage emitted it. That immediately tells you how to fix it.

  • SyntaxError → parser. Almost always a typo.
  • TypeError: cannot assign ... → semantic pass on the AST (static lang) or runtime dispatch (dynamic lang).
  • undefined reference to 'foo' → linker. A symbol wasn't found.
  • error while loading shared libraries → dynamic loader. A .so is missing.
  • Segmentation fault → runtime. Usually a bad pointer deref.
  • GC overhead limit exceeded → runtime. The collector can't keep up.

Stages aren't abstractions compilers hide from you; they're exactly what the toolchain is doing. Print the AST, dump the IR, objdump the binary, strace the loader. You can watch every stage happen.