Text Processing
awk vs sed vs grep — picking the right tool for the line.
Text Processing
Most of "data engineering" at the shell is six tools and a pipe character. Once you can fluently chain them, an enormous category of "should I write a script for this?" questions disappears.
Analogy
Think of the kitchen: you don't grind, slice, and zest with the same blade. Even a great chef's knife loses to a peeler when all you want is the skin off a carrot. grep is the spotlight that finds what you want. sed is the surgeon's scalpel that changes one thing on a moving line. awk is the cutting board that gives you fields and lets you do real work on them. Pick the wrong tool and you'll be carving a turkey with a paring knife.
The toolkit
| Tool | Job |
|---|---|
grep |
Find lines that match a pattern. |
sed |
Edit lines in a stream — substitute, delete, insert. |
awk |
Treat lines as records of fields; arithmetic and aggregation. |
cut |
Slice fixed character or delimiter ranges out of each line. |
sort |
Reorder lines (lexically, numerically, by field, reverse, ...). |
uniq |
Collapse adjacent duplicate lines (often -c to count). |
wc |
Count lines, words, bytes. |
tr |
Single-character translate / squeeze / delete. |
paste |
Glue files together column-wise. |
Master these nine and ~80% of ad-hoc text munging at the shell becomes one line.
The pattern that unlocks the toolkit
Almost every interesting one-liner is the same shape:
some-source | filter | transform | aggregate | order | take
The "top 10 IPs in an nginx log" classic:
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -n 10
awk '{print $1}'— pull the first field (the IP).sort— group identical lines together (uniq only sees adjacency).uniq -c— collapse runs and prepend the count.sort -rn— reorder by count, descending, numerically.head -n 10— top ten.
Internalise this shape and you'll write a dozen variations a week.
When to pick which tool
grep — when you only need "does this line match"
Pure pattern search. Fast, simple, the right answer when:
- You want lines containing a word:
grep ERROR app.log - You want lines that don't match:
grep -v healthcheck access.log - You want surrounding context:
grep -B2 -A5 'panic:' app.log - You're searching a big tree recursively: prefer
rg(ripgrep) — multithreaded, ignores.gitignore, ~10× faster.
If you find yourself piping grep | grep | grep, you probably want awk instead.
sed — when you want to edit lines as they pass by
sed is a streaming text editor. It shines for substitutions:
sed 's/localhost/db.internal/g' app.conf
…or for in-place edits across many files:
sed -i.bak 's/localhost/db.internal/g' *.conf
The -i.bak form makes a .bak backup of each file first. Always use it the first few times you sed in-place — it's the difference between "oh shit" and "oh well, let me restore the .bak".
You can also delete lines (sed '/healthcheck/d'), insert above a match (sed '/^Server:/i\X-Custom: yes'), and operate on line ranges. Once your sed expression has more than two commands, switch to awk.
awk — when columns matter
This is the workhorse. awk automatically splits each line into fields ($1, $2, ...) and gives you a tiny scripting language to operate on them.
Print the 3rd field:
awk '{print $3}' file
Sum the 3rd column:
awk '{ s += $3 } END { print s }' usage.tsv
Filter to rows where field 5 is "ERROR" and project the timestamp + message:
awk '$5 == "ERROR" { print $1, $6 }' app.log
Use a different delimiter (CSV):
awk -F, '{ print $2 }' people.csv
The mental model: awk runs each line through a series of pattern { action } rules, with BEGIN and END blocks for setup/teardown. It's a 5-minute-to-learn, lifetime-of-utility tool.
cut — when you only need a slice
cut is the simplest tool — fixed character ranges or delimited fields, nothing else:
cut -c1-19 app.log # first 19 chars (e.g. ISO timestamp)
cut -d: -f1 /etc/passwd # everything before the first colon
It's barely worth using when awk is on the system (which is everywhere), but it's so cheap that the muscle memory is worth keeping.
sort + uniq is one tool
uniq only collapses adjacent duplicates. So sort | uniq is the canonical idiom — sort first to group, then uniq to dedup or count:
sort file # group duplicates next to each other
sort -u file # same as sort | uniq
sort | uniq -c # group + count
sort -rn # reorder by count, descending, numerically
Common flags worth memorising:
sort -k 3,3n— sort by the 3rd field, numerically.sort -t,— split on commas (CSV).sort -u— built-in uniq.uniq -c— count.uniq -d— only show duplicates.
Why awk wins when columns matter
The moment you're doing arithmetic, conditionals, or aggregations across rows, awk lets you write the entire program in one expression:
# Average of column 4 grouped by column 1
awk '{ s[$1] += $4; c[$1] += 1 }
END { for (k in s) printf "%s %.2f\n", k, s[k]/c[k] }' data.tsv
That's a working group-by-and-average in three lines, no Python, no temp files, no libraries.
When to reach for Python instead
Three triggers tell you to switch:
- You need a real data structure. Nested dicts, sets-of-tuples, hash-by-tuple. awk has flat associative arrays only.
- You're walking multiple files and need to remember state across them — joins, lookups, cross-references.
- The pipeline is starting to feel write-only. If you can't read your awk after lunch, your future self will thank you for switching.
A useful threshold: if the awk gets past three lines or a gsub plus a for, it's usually clearer in Python.
Worth knowing exists
tr— single-character translate.tr A-Z a-zlowercases.tr -d '\r'strips Windows CRs.paste— glue files horizontally:paste -d, names.txt scores.txt.fold— wrap long lines at a column.xargs— convert stdin into command-line arguments. Pairs withfind(next lesson).column -t— pretty-print whitespace-delimited tables.
A fluency checklist
You're fluent at this layer when you can, without thinking:
- Find lines containing a phrase, with context:
grep -B2 -A2 PHRASE - Replace text in many files with backups:
sed -i.bak 's/X/Y/g' files... - Print just the second field:
awk '{print $2}' - Get the 10 most common values:
... | sort | uniq -c | sort -rn | head - Sum a column:
awk '{s+=$N} END {print s}' - Filter to a column-condition + project:
awk '$5=="X" {print $1,$2}'
Get there and you'll feel the day-to-day pressure to write throwaway Python scripts disappear.
Tools in the wild
5 tools- cliGNU awkfree tier
The reference awk — `gawk` adds extensions but the POSIX core is what every box has.
- cliGNU sedfree tier
Stream editor for substitutions and line ranges; `-i` for in-place edits.
- cliripgrep (rg)free tier
Multithreaded grep written in Rust. Drop-in replacement; ~10× faster on big trees.
- climiller (mlr)free tier
awk for CSV/TSV/JSON — `mlr stats1`, `mlr put`, `mlr cat`.
- clichoosefree tier
Friendlier `cut` — `choose 0 2 4` picks fields 0/2/4.