unix · level 6

Text Processing

awk vs sed vs grep — picking the right tool for the line.

175 XP

Text Processing

Most of "data engineering" at the shell is six tools and a pipe character. Once you can fluently chain them, an enormous category of "should I write a script for this?" questions disappears.

Analogy

Think of the kitchen: you don't grind, slice, and zest with the same blade. Even a great chef's knife loses to a peeler when all you want is the skin off a carrot. grep is the spotlight that finds what you want. sed is the surgeon's scalpel that changes one thing on a moving line. awk is the cutting board that gives you fields and lets you do real work on them. Pick the wrong tool and you'll be carving a turkey with a paring knife.

The toolkit

Tool Job
grep Find lines that match a pattern.
sed Edit lines in a stream — substitute, delete, insert.
awk Treat lines as records of fields; arithmetic and aggregation.
cut Slice fixed character or delimiter ranges out of each line.
sort Reorder lines (lexically, numerically, by field, reverse, ...).
uniq Collapse adjacent duplicate lines (often -c to count).
wc Count lines, words, bytes.
tr Single-character translate / squeeze / delete.
paste Glue files together column-wise.

Master these nine and ~80% of ad-hoc text munging at the shell becomes one line.

The pattern that unlocks the toolkit

Almost every interesting one-liner is the same shape:

some-source | filter | transform | aggregate | order | take

The "top 10 IPs in an nginx log" classic:

awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -n 10
  • awk '{print $1}' — pull the first field (the IP).
  • sort — group identical lines together (uniq only sees adjacency).
  • uniq -c — collapse runs and prepend the count.
  • sort -rn — reorder by count, descending, numerically.
  • head -n 10 — top ten.

Internalise this shape and you'll write a dozen variations a week.

When to pick which tool

grep — when you only need "does this line match"

Pure pattern search. Fast, simple, the right answer when:

  • You want lines containing a word: grep ERROR app.log
  • You want lines that don't match: grep -v healthcheck access.log
  • You want surrounding context: grep -B2 -A5 'panic:' app.log
  • You're searching a big tree recursively: prefer rg (ripgrep) — multithreaded, ignores .gitignore, ~10× faster.

If you find yourself piping grep | grep | grep, you probably want awk instead.

sed — when you want to edit lines as they pass by

sed is a streaming text editor. It shines for substitutions:

sed 's/localhost/db.internal/g' app.conf

…or for in-place edits across many files:

sed -i.bak 's/localhost/db.internal/g' *.conf

The -i.bak form makes a .bak backup of each file first. Always use it the first few times you sed in-place — it's the difference between "oh shit" and "oh well, let me restore the .bak".

You can also delete lines (sed '/healthcheck/d'), insert above a match (sed '/^Server:/i\X-Custom: yes'), and operate on line ranges. Once your sed expression has more than two commands, switch to awk.

awk — when columns matter

This is the workhorse. awk automatically splits each line into fields ($1, $2, ...) and gives you a tiny scripting language to operate on them.

Print the 3rd field:

awk '{print $3}' file

Sum the 3rd column:

awk '{ s += $3 } END { print s }' usage.tsv

Filter to rows where field 5 is "ERROR" and project the timestamp + message:

awk '$5 == "ERROR" { print $1, $6 }' app.log

Use a different delimiter (CSV):

awk -F, '{ print $2 }' people.csv

The mental model: awk runs each line through a series of pattern { action } rules, with BEGIN and END blocks for setup/teardown. It's a 5-minute-to-learn, lifetime-of-utility tool.

cut — when you only need a slice

cut is the simplest tool — fixed character ranges or delimited fields, nothing else:

cut -c1-19 app.log         # first 19 chars (e.g. ISO timestamp)
cut -d: -f1 /etc/passwd     # everything before the first colon

It's barely worth using when awk is on the system (which is everywhere), but it's so cheap that the muscle memory is worth keeping.

sort + uniq is one tool

uniq only collapses adjacent duplicates. So sort | uniq is the canonical idiom — sort first to group, then uniq to dedup or count:

sort file        # group duplicates next to each other
sort -u file     # same as sort | uniq
sort | uniq -c   # group + count
sort -rn         # reorder by count, descending, numerically

Common flags worth memorising:

  • sort -k 3,3n — sort by the 3rd field, numerically.
  • sort -t, — split on commas (CSV).
  • sort -u — built-in uniq.
  • uniq -c — count.
  • uniq -d — only show duplicates.

Why awk wins when columns matter

The moment you're doing arithmetic, conditionals, or aggregations across rows, awk lets you write the entire program in one expression:

# Average of column 4 grouped by column 1
awk '{ s[$1] += $4; c[$1] += 1 }
     END { for (k in s) printf "%s %.2f\n", k, s[k]/c[k] }' data.tsv

That's a working group-by-and-average in three lines, no Python, no temp files, no libraries.

When to reach for Python instead

Three triggers tell you to switch:

  1. You need a real data structure. Nested dicts, sets-of-tuples, hash-by-tuple. awk has flat associative arrays only.
  2. You're walking multiple files and need to remember state across them — joins, lookups, cross-references.
  3. The pipeline is starting to feel write-only. If you can't read your awk after lunch, your future self will thank you for switching.

A useful threshold: if the awk gets past three lines or a gsub plus a for, it's usually clearer in Python.

Worth knowing exists

  • tr — single-character translate. tr A-Z a-z lowercases. tr -d '\r' strips Windows CRs.
  • paste — glue files horizontally: paste -d, names.txt scores.txt.
  • fold — wrap long lines at a column.
  • xargs — convert stdin into command-line arguments. Pairs with find (next lesson).
  • column -t — pretty-print whitespace-delimited tables.

A fluency checklist

You're fluent at this layer when you can, without thinking:

  • Find lines containing a phrase, with context: grep -B2 -A2 PHRASE
  • Replace text in many files with backups: sed -i.bak 's/X/Y/g' files...
  • Print just the second field: awk '{print $2}'
  • Get the 10 most common values: ... | sort | uniq -c | sort -rn | head
  • Sum a column: awk '{s+=$N} END {print s}'
  • Filter to a column-condition + project: awk '$5=="X" {print $1,$2}'

Get there and you'll feel the day-to-day pressure to write throwaway Python scripts disappear.

Tools in the wild

5 tools
  • GNU awkfree tier

    The reference awk — `gawk` adds extensions but the POSIX core is what every box has.

    cli
  • GNU sedfree tier

    Stream editor for substitutions and line ranges; `-i` for in-place edits.

    cli
  • ripgrep (rg)free tier

    Multithreaded grep written in Rust. Drop-in replacement; ~10× faster on big trees.

    cli
  • miller (mlr)free tier

    awk for CSV/TSV/JSON — `mlr stats1`, `mlr put`, `mlr cat`.

    cli
  • choosefree tier

    Friendlier `cut` — `choose 0 2 4` picks fields 0/2/4.

    cli