File Formats
CSV, JSON, YAML, TOML — picking the right one for the job.
File Formats
Every system you build will read or write structured data. The format you pick matters more than people think — wrong format means fighting your tools forever; right format means the file fades into the background.
Analogy
File formats are like containers in a kitchen. Tupperware (CSV) stacks neatly and fits a fridge full of leftovers — flat rectangles of food, no problem. Jars with screw lids (JSON) hold sauces and pickled things; you can nest small jars in big jars. Mason jars with handwritten labels (YAML) feel friendly but the labels rub off and you end up with "milk" that's actually buttermilk. Bento boxes (TOML) have separate compartments labelled by type — annoying for a casserole but perfect for a packed lunch. The mistake isn't using any of them; the mistake is putting soup in tupperware.
The four you'll meet
| Format | Strengths | Weaknesses | Use it for |
|---|---|---|---|
| CSV | Tiny, every tool reads it, scales to GB | Flat only, ambiguous quoting | Tabular data: spreadsheets, bulk imports, analytics |
| JSON | Universal, strict, machine-readable | No comments, awkward for humans | API payloads, logs, anything between services |
| YAML | Most human-readable | Implicit type coercion, indentation traps | Config when a tool requires it (k8s, GitHub Actions) |
| TOML | Human-readable + explicit types | Less nesting power, less popular | Hand-edited config (Cargo.toml, pyproject.toml) |
A second tier you'll occasionally meet: XML (slowly dying outside enterprise), INI (still alive in Windows-land), Protocol Buffers / Avro / MessagePack (binary; for performance-sensitive RPC), Parquet / ORC (columnar; for analytics).
CSV
The oldest and simplest. Every row is a record; every column is a field; values are separated by a delimiter (usually ,).
id,name,email,signup_date
1,Alice,alice@example.com,2026-01-15
2,Bob,bob@example.com,2026-02-03
Strengths. Tiny on disk. Streamable — you can read row-by-row without loading the whole file. Every analytics tool, spreadsheet, and database can ingest it. For tabular data, nothing beats it.
Weaknesses.
- The "is comma the separator" problem. Europe traditionally uses
;because,is the decimal separator. Tab-separated (TSV) is also common. Always specify the delimiter explicitly when reading CSV; never trust the default. - Quoting and embedded newlines. A field containing a comma must be quoted:
"Smith, John". A field containing a quote must double the quote:"He said ""hi""". A field containing a newline must be quoted, and now your "one row per line" assumption is wrong. - No schema. Column names live in the first row by convention. Types are nowhere — every value is a string until your code parses it.
- No nesting. A column can be a comma-separated list, but at that point you're putting CSV inside CSV.
Don't write CSV by string concatenation. Use a library (csv-parse, python-csv, encoding/csv in Go) — they handle quoting, escaping, and the encoding edge cases.
JSON
The lingua franca of the web. Two data types matter:
- Object —
{"key": "value"}. Keys must be strings. - Array —
[1, 2, 3]. Order is preserved.
Plus four primitive types: string (double-quoted), number (no NaN or Infinity allowed), true/false, null.
Strengths. Strict — only one way to write things. Universal — every language has parser and serializer in stdlib or first-tier library. Easy to validate against a schema (JSON Schema, Zod). Compact when minified.
Weaknesses.
- No comments. This is the most-complained-about absence. The pragmatic workaround: a
_commentfield. The technical workaround: JSONC or JSON5 (extensions that allow comments) — but they're not the JSON yourJSON.parseaccepts. - Trailing commas are a syntax error.
{"a": 1,}won't parse. JavaScript fixed this in source code; JSON did not. - Numbers have no specified precision.
1.1 + 2.2 === 3.3000000000000003in JS — JSON doesn't say what numeric type to use, so different parsers do different things with very large or very precise numbers. For monetary values, use strings. - Verbose for human editing of large files. Quote-comma-quote-comma. YAML and TOML solve this.
The right tool when machines write to other machines.
YAML
Designed to be human-readable. Significant indentation (like Python). Uses key: value instead of "key": "value".
name: app
retries: 3
flags:
debug: true
servers:
- alpha
- beta
Strengths. Compact. Comments allowed. Good for hand-editing.
Weaknesses — and they are real.
- The quoting tarpit.
version: 3.10parses as the float 3.1 — the trailing zero is gone.country: NOparses as the booleanfalse(an old YAML 1.1 thing — the "Norway problem").time: 12:30may parse as the integer 750 in some implementations. The fix is to quote anything that looks like a number, boolean, or date if you actually mean a string. The bigger fix is to use TOML. - Indentation matters. Tabs vs spaces, two vs four — every YAML hand-edit is a chance to break the file.
- Anchors and references. YAML has
&anchorand*referencefor re-using values. Powerful but rarely understood. Most teams disable these via a "safe load" mode. - Multiple spec versions. YAML 1.0, 1.1, 1.2 differ in subtle ways. Most parsers default to 1.1 quirks for backwards-compat.
YAML is the right format when a tool you use requires it — Kubernetes, GitHub Actions, Docker Compose, Ansible. Don't pick YAML for a new file format you control.
TOML
Tom's Obvious, Minimal Language. Hand-editable config without YAML's footguns.
name = "app"
retries = 3
[flags]
debug = true
[[servers]]
name = "alpha"
[[servers]]
name = "beta"
Strengths.
- Explicit types.
version = "3.10"is a string;version = 3.10is a float. No coercion, no surprises. - No indentation rules. Tables are explicit (
[section]). - Comments allowed —
#to end of line. - Strong for sectioned config — Cargo, pyproject, gleam, hugo, gopls all use TOML.
Weaknesses.
- Awkward for deep nesting. Arrays of tables (
[[servers]]) work but get verbose past two levels. - Less ubiquitous than YAML/JSON — pick a library; not every language has a stdlib parser yet (Python 3.11+ does).
For "config file the developer hand-edits", TOML is the modern best-practice choice.
Picking the format
A short decision tree:
- Is it tabular data? → CSV (or Parquet for analytics).
- Is it a payload between machines? → JSON.
- Is it config that humans will hand-edit? → TOML.
- Does the tool I'm using force YAML? → YAML.
- Is performance critical and the schema fixed? → Protobuf, Avro, MessagePack.
Notice what's NOT on this tree: "general-purpose data interchange that humans will edit AND machines will consume". That's the YAML zone, and YAML's quoting tarpit means you'll hate it eventually. If you have a free choice, prefer JSON for the machine side and a separate human-friendly view (markdown, TOML, a UI) for the human side.
Common bugs and how to spot them
- CSV with one column. You guessed
,and the file uses;. Specify the delimiter. - Numbers that aren't quite right in JSON. Floating-point precision, or your parser silently coerced a 19-digit ID to a 53-bit float. Use strings for IDs and money.
- YAML config where a value is the wrong type. Implicit type coercion. Quote it.
- TOML file the parser refuses. Almost always a missing
[section]header or a duplicate key — TOML is strict about both. - CSV with embedded newlines breaking your line-by-line reader. Use a real CSV parser, not
split("\n").
Tooling — knowing the unix verbs
You'll work with these formats from the shell. The verbs:
jq— query and transform JSON.jq '.users[] | select(.active)' < users.json.yq— same but for YAML/TOML/XML.yq '.spec.replicas' < deployment.yaml.miller—awkfor tabular data.mlr --csv filter '$status == "active"'.csvkit— CSV-specific suite.csvsql --query "SELECT count(*) FROM stdin" < data.csv.
These are the most leveraged tools you can have on your $PATH for working with data files.
What to internalise
- File formats are tools — pick the right one and the file fades away.
- CSV: tabular, flat, always specify the delimiter.
- JSON: inter-service, strict, no comments.
- YAML: only when forced — quote anything ambiguous.
- TOML: hand-edited config, explicit types, your friend.
Tools in the wild
5 tools- clijqfree tier
The unix tool for slicing and querying JSON.
- cliyqfree tier
jq-style queries for YAML/TOML/XML.
- climillerfree tier
Like awk for CSV/TSV/JSON-lines tabular data.
- clicsvkitfree tier
Suite for slicing, joining, and SQL-querying CSV files.
- serviceVS Code YAML extensionfree tier
Schema-aware YAML editing — catches type-coercion bugs at edit time.