UTF-8 & Unicode Normalization
Variable-byte encoding, leading vs continuation bytes, and why string length is a lie.
UTF-8 & Unicode Normalization
ASCII covers 128 characters. Unicode covers over a million. UTF-8 is the variable-byte encoding that bridges them — every English character still fits in one byte, but every other writing system on Earth gets a place in the byte stream too. Understand UTF-8 and you understand why "string length" is one of the most ambiguous questions in software.
Variable-byte, prefix-coded
UTF-8 encodes a Unicode codepoint into between 1 and 4 bytes, depending on its value. The high bits of the leading byte tell the decoder how many bytes the codepoint takes:
| Codepoint range | Bytes | Leading byte | Continuation bytes |
|---|---|---|---|
U+0000–U+007F (ASCII) |
1 | 0xxxxxxx |
— |
U+0080–U+07FF |
2 | 110xxxxx |
10xxxxxx |
U+0800–U+FFFF |
3 | 1110xxxx |
10xxxxxx × 2 |
U+10000–U+10FFFF |
4 | 11110xxx |
10xxxxxx × 3 |
Two properties fall out of this design:
- Self-synchronising. Every continuation byte starts with
10, every leading byte does not. If you land in the middle of a stream, walk forward a few bytes until you find one that doesn't start with10— that's the next leading byte. - ASCII-compatible. Bytes
0x00–0x7Fmean exactly the same thing in ASCII and UTF-8. Decades-old C code that scanned byte-by-byte for\n,/,:keeps working unchanged.
"é" = U+00E9
1100 0011 | 1010 1001 (2 bytes, leading 110, continuation 10)
"漢" = U+6F22
1110 0110 | 1011 1100 | 1010 0010 (3 bytes)
"😀" = U+1F600
1111 0000 | 1001 1111 | 1001 1000 | 1000 0000 (4 bytes)
The BOM is mostly a Windows artefact
UTF-8 doesn't need a byte-order mark — it has no byte order to mark. But Windows tooling occasionally writes a BOM (EF BB BF) at the start of UTF-8 files anyway. Some POSIX tools choke on it; modern parsers strip it silently. If a file mysteriously fails to parse as JSON, run file or xxd on it and check for EF BB BF.
What "string length" actually means
Take the string café and ask "how long is it?" There are at least three honest answers:
- Byte length in UTF-8:
5(the é is two bytes). - UTF-16 code-unit length (what JavaScript's
s.lengthreturns):4. - Grapheme-cluster length (what a user would call "characters"):
4.
For the family-of-four emoji 👨👩👧👦 the answers are even worse:
| Question | Answer |
|---|---|
| UTF-8 byte length | 25 |
UTF-16 code units (JS s.length) |
11 |
| Codepoints | 7 |
| Grapheme clusters (what the user sees) | 1 |
If you charge by characters in a text-message API and you use s.length, you charge eleven times for one emoji.
const s = "café";
new TextEncoder().encode(s).length; // 5 (UTF-8 bytes)
s.length; // 4 (UTF-16 code units)
[...new Intl.Segmenter().segment(s)].length; // 4 (grapheme clusters)
Normalization — the same string, different bytes
Unicode often offers more than one way to spell the same visible character. The letter é can be:
- A single precomposed codepoint,
U+00E9. - A two-codepoint sequence: base
e(U+0065) + combining acute accent (U+0301).
Both render identically. Both should compare equal in a sane username system. They don't, by default.
The fix is normalization, defined by Unicode in four forms:
| Form | Composition | Compatibility folds |
|---|---|---|
| NFC | Composed | No |
| NFD | Decomposed | No |
| NFKC | Composed | Yes |
| NFKD | Decomposed | Yes |
NFC is what the web has standardised on. Every URL, every JSON document, every database identifier should be NFC. The compatibility forms (NFKC/NFKD) also fold things like the half-width ハ to the full-width ハ and the parenthesised ㈱ to (株) — useful for search, dangerous if you preserve user-typed text verbatim.
const a = "café"; // U+0063 U+0061 U+0066 U+00E9
const b = "café"; // U+0063 U+0061 U+0066 U+0065 U+0301
a === b; // false — different byte sequences
a.normalize("NFC") === b.normalize("NFC"); // true
Where this bites in production
- Authentication — a user signs up as
joséand logs in asjosé(decomposed). They never log in again. Always normalize before comparing usernames or hashing passwords. - Search — without normalization, a search for
naïvemisses documents that wrotenaïvewith a combining diaeresis. - Filename collisions — macOS HFS+ stores filenames in a near-NFD form; APFS and Linux store them in whatever bytes you pass. A folder copied between them can produce two files that look identical.
- Hash digests —
sha256("café")is one digest if the input is NFC and a different digest if it's NFD. Sign the bytes you mean to sign.
Rules of thumb
- Always store and transmit UTF-8. Modern protocols (HTTP, JSON, gRPC) assume it. UTF-16 is for Windows internals and Java strings; UTF-32 is rarely worth the space.
- Normalize on the boundary. When data arrives from the user, normalize to NFC. When it leaves, leave it alone.
- Don't trust
.length. If you need to count "characters" the way a user does, segment by grapheme cluster —Intl.Segmenterin JS,regexwith\Xin PCRE,unicodesegin Go. - Pick the right boundary unit for the job. Bytes for storage and network, code units for legacy APIs, codepoints for a programmer's view, graphemes for a user's view.
Tools in the wild
4 tools- cliiconvfree tier
Convert between encodings (UTF-8 ↔ UTF-16 ↔ Latin-1) on the command line.
- libraryICUfree tier
The reference Unicode library — used by every major language runtime under the hood.
- libraryIntl.Segmenter (browser)free tier
Browser-native grapheme/word/sentence segmentation. The right way to count emoji.
- libraryunicodedata (Python stdlib)free tier
Normalize, lookup, and inspect Unicode codepoints from Python.