foundations · level 8

File Formats

CSV, JSON, YAML, TOML — picking the right one for the job.

150 XP

File Formats

Every system you build will read or write structured data. The format you pick matters more than people think — wrong format means fighting your tools forever; right format means the file fades into the background.

Analogy

File formats are like containers in a kitchen. Tupperware (CSV) stacks neatly and fits a fridge full of leftovers — flat rectangles of food, no problem. Jars with screw lids (JSON) hold sauces and pickled things; you can nest small jars in big jars. Mason jars with handwritten labels (YAML) feel friendly but the labels rub off and you end up with "milk" that's actually buttermilk. Bento boxes (TOML) have separate compartments labelled by type — annoying for a casserole but perfect for a packed lunch. The mistake isn't using any of them; the mistake is putting soup in tupperware.

The four you'll meet

Format Strengths Weaknesses Use it for
CSV Tiny, every tool reads it, scales to GB Flat only, ambiguous quoting Tabular data: spreadsheets, bulk imports, analytics
JSON Universal, strict, machine-readable No comments, awkward for humans API payloads, logs, anything between services
YAML Most human-readable Implicit type coercion, indentation traps Config when a tool requires it (k8s, GitHub Actions)
TOML Human-readable + explicit types Less nesting power, less popular Hand-edited config (Cargo.toml, pyproject.toml)

A second tier you'll occasionally meet: XML (slowly dying outside enterprise), INI (still alive in Windows-land), Protocol Buffers / Avro / MessagePack (binary; for performance-sensitive RPC), Parquet / ORC (columnar; for analytics).

CSV

The oldest and simplest. Every row is a record; every column is a field; values are separated by a delimiter (usually ,).

id,name,email,signup_date
1,Alice,alice@example.com,2026-01-15
2,Bob,bob@example.com,2026-02-03

Strengths. Tiny on disk. Streamable — you can read row-by-row without loading the whole file. Every analytics tool, spreadsheet, and database can ingest it. For tabular data, nothing beats it.

Weaknesses.

  • The "is comma the separator" problem. Europe traditionally uses ; because , is the decimal separator. Tab-separated (TSV) is also common. Always specify the delimiter explicitly when reading CSV; never trust the default.
  • Quoting and embedded newlines. A field containing a comma must be quoted: "Smith, John". A field containing a quote must double the quote: "He said ""hi""". A field containing a newline must be quoted, and now your "one row per line" assumption is wrong.
  • No schema. Column names live in the first row by convention. Types are nowhere — every value is a string until your code parses it.
  • No nesting. A column can be a comma-separated list, but at that point you're putting CSV inside CSV.

Don't write CSV by string concatenation. Use a library (csv-parse, python-csv, encoding/csv in Go) — they handle quoting, escaping, and the encoding edge cases.

JSON

The lingua franca of the web. Two data types matter:

  • Object{"key": "value"}. Keys must be strings.
  • Array[1, 2, 3]. Order is preserved.

Plus four primitive types: string (double-quoted), number (no NaN or Infinity allowed), true/false, null.

Strengths. Strict — only one way to write things. Universal — every language has parser and serializer in stdlib or first-tier library. Easy to validate against a schema (JSON Schema, Zod). Compact when minified.

Weaknesses.

  • No comments. This is the most-complained-about absence. The pragmatic workaround: a _comment field. The technical workaround: JSONC or JSON5 (extensions that allow comments) — but they're not the JSON your JSON.parse accepts.
  • Trailing commas are a syntax error. {"a": 1,} won't parse. JavaScript fixed this in source code; JSON did not.
  • Numbers have no specified precision. 1.1 + 2.2 === 3.3000000000000003 in JS — JSON doesn't say what numeric type to use, so different parsers do different things with very large or very precise numbers. For monetary values, use strings.
  • Verbose for human editing of large files. Quote-comma-quote-comma. YAML and TOML solve this.

The right tool when machines write to other machines.

YAML

Designed to be human-readable. Significant indentation (like Python). Uses key: value instead of "key": "value".

name: app
retries: 3
flags:
  debug: true
servers:
  - alpha
  - beta

Strengths. Compact. Comments allowed. Good for hand-editing.

Weaknesses — and they are real.

  • The quoting tarpit. version: 3.10 parses as the float 3.1 — the trailing zero is gone. country: NO parses as the boolean false (an old YAML 1.1 thing — the "Norway problem"). time: 12:30 may parse as the integer 750 in some implementations. The fix is to quote anything that looks like a number, boolean, or date if you actually mean a string. The bigger fix is to use TOML.
  • Indentation matters. Tabs vs spaces, two vs four — every YAML hand-edit is a chance to break the file.
  • Anchors and references. YAML has &anchor and *reference for re-using values. Powerful but rarely understood. Most teams disable these via a "safe load" mode.
  • Multiple spec versions. YAML 1.0, 1.1, 1.2 differ in subtle ways. Most parsers default to 1.1 quirks for backwards-compat.

YAML is the right format when a tool you use requires it — Kubernetes, GitHub Actions, Docker Compose, Ansible. Don't pick YAML for a new file format you control.

TOML

Tom's Obvious, Minimal Language. Hand-editable config without YAML's footguns.

name = "app"
retries = 3

[flags]
debug = true

[[servers]]
name = "alpha"
[[servers]]
name = "beta"

Strengths.

  • Explicit types. version = "3.10" is a string; version = 3.10 is a float. No coercion, no surprises.
  • No indentation rules. Tables are explicit ([section]).
  • Comments allowed# to end of line.
  • Strong for sectioned config — Cargo, pyproject, gleam, hugo, gopls all use TOML.

Weaknesses.

  • Awkward for deep nesting. Arrays of tables ([[servers]]) work but get verbose past two levels.
  • Less ubiquitous than YAML/JSON — pick a library; not every language has a stdlib parser yet (Python 3.11+ does).

For "config file the developer hand-edits", TOML is the modern best-practice choice.

Picking the format

A short decision tree:

  1. Is it tabular data? → CSV (or Parquet for analytics).
  2. Is it a payload between machines? → JSON.
  3. Is it config that humans will hand-edit? → TOML.
  4. Does the tool I'm using force YAML? → YAML.
  5. Is performance critical and the schema fixed? → Protobuf, Avro, MessagePack.

Notice what's NOT on this tree: "general-purpose data interchange that humans will edit AND machines will consume". That's the YAML zone, and YAML's quoting tarpit means you'll hate it eventually. If you have a free choice, prefer JSON for the machine side and a separate human-friendly view (markdown, TOML, a UI) for the human side.

Common bugs and how to spot them

  • CSV with one column. You guessed , and the file uses ;. Specify the delimiter.
  • Numbers that aren't quite right in JSON. Floating-point precision, or your parser silently coerced a 19-digit ID to a 53-bit float. Use strings for IDs and money.
  • YAML config where a value is the wrong type. Implicit type coercion. Quote it.
  • TOML file the parser refuses. Almost always a missing [section] header or a duplicate key — TOML is strict about both.
  • CSV with embedded newlines breaking your line-by-line reader. Use a real CSV parser, not split("\n").

Tooling — knowing the unix verbs

You'll work with these formats from the shell. The verbs:

  • jq — query and transform JSON. jq '.users[] | select(.active)' < users.json.
  • yq — same but for YAML/TOML/XML. yq '.spec.replicas' < deployment.yaml.
  • millerawk for tabular data. mlr --csv filter '$status == "active"'.
  • csvkit — CSV-specific suite. csvsql --query "SELECT count(*) FROM stdin" < data.csv.

These are the most leveraged tools you can have on your $PATH for working with data files.

What to internalise

  • File formats are tools — pick the right one and the file fades away.
  • CSV: tabular, flat, always specify the delimiter.
  • JSON: inter-service, strict, no comments.
  • YAML: only when forced — quote anything ambiguous.
  • TOML: hand-edited config, explicit types, your friend.

Tools in the wild

5 tools
  • jqfree tier

    The unix tool for slicing and querying JSON.

    cli
  • yqfree tier

    jq-style queries for YAML/TOML/XML.

    cli
  • millerfree tier

    Like awk for CSV/TSV/JSON-lines tabular data.

    cli
  • csvkitfree tier

    Suite for slicing, joining, and SQL-querying CSV files.

    cli
  • Schema-aware YAML editing — catches type-coercion bugs at edit time.

    service