encoding · level 6

UTF-8 & Unicode Normalization

Variable-byte encoding, leading vs continuation bytes, and why string length is a lie.

180 XP

UTF-8 & Unicode Normalization

ASCII covers 128 characters. Unicode covers over a million. UTF-8 is the variable-byte encoding that bridges them — every English character still fits in one byte, but every other writing system on Earth gets a place in the byte stream too. Understand UTF-8 and you understand why "string length" is one of the most ambiguous questions in software.

Variable-byte, prefix-coded

UTF-8 encodes a Unicode codepoint into between 1 and 4 bytes, depending on its value. The high bits of the leading byte tell the decoder how many bytes the codepoint takes:

Codepoint range Bytes Leading byte Continuation bytes
U+0000U+007F (ASCII) 1 0xxxxxxx
U+0080U+07FF 2 110xxxxx 10xxxxxx
U+0800U+FFFF 3 1110xxxx 10xxxxxx × 2
U+10000U+10FFFF 4 11110xxx 10xxxxxx × 3

Two properties fall out of this design:

  • Self-synchronising. Every continuation byte starts with 10, every leading byte does not. If you land in the middle of a stream, walk forward a few bytes until you find one that doesn't start with 10 — that's the next leading byte.
  • ASCII-compatible. Bytes 0x000x7F mean exactly the same thing in ASCII and UTF-8. Decades-old C code that scanned byte-by-byte for \n, /, : keeps working unchanged.
"é"  = U+00E9
       1100 0011 | 1010 1001     (2 bytes, leading 110, continuation 10)

"漢" = U+6F22
       1110 0110 | 1011 1100 | 1010 0010   (3 bytes)

"😀" = U+1F600
       1111 0000 | 1001 1111 | 1001 1000 | 1000 0000   (4 bytes)

The BOM is mostly a Windows artefact

UTF-8 doesn't need a byte-order mark — it has no byte order to mark. But Windows tooling occasionally writes a BOM (EF BB BF) at the start of UTF-8 files anyway. Some POSIX tools choke on it; modern parsers strip it silently. If a file mysteriously fails to parse as JSON, run file or xxd on it and check for EF BB BF.

What "string length" actually means

Take the string café and ask "how long is it?" There are at least three honest answers:

  1. Byte length in UTF-8: 5 (the é is two bytes).
  2. UTF-16 code-unit length (what JavaScript's s.length returns): 4.
  3. Grapheme-cluster length (what a user would call "characters"): 4.

For the family-of-four emoji 👨‍👩‍👧‍👦 the answers are even worse:

Question Answer
UTF-8 byte length 25
UTF-16 code units (JS s.length) 11
Codepoints 7
Grapheme clusters (what the user sees) 1

If you charge by characters in a text-message API and you use s.length, you charge eleven times for one emoji.

const s = "café";
new TextEncoder().encode(s).length;     // 5  (UTF-8 bytes)
s.length;                                // 4  (UTF-16 code units)
[...new Intl.Segmenter().segment(s)].length; // 4 (grapheme clusters)

Normalization — the same string, different bytes

Unicode often offers more than one way to spell the same visible character. The letter é can be:

  • A single precomposed codepoint, U+00E9.
  • A two-codepoint sequence: base e (U+0065) + combining acute accent (U+0301).

Both render identically. Both should compare equal in a sane username system. They don't, by default.

The fix is normalization, defined by Unicode in four forms:

Form Composition Compatibility folds
NFC Composed No
NFD Decomposed No
NFKC Composed Yes
NFKD Decomposed Yes

NFC is what the web has standardised on. Every URL, every JSON document, every database identifier should be NFC. The compatibility forms (NFKC/NFKD) also fold things like the half-width ハ to the full-width ハ and the parenthesised ㈱ to (株) — useful for search, dangerous if you preserve user-typed text verbatim.

const a = "café";          // U+0063 U+0061 U+0066 U+00E9
const b = "café";         // U+0063 U+0061 U+0066 U+0065 U+0301
a === b;                                  // false — different byte sequences
a.normalize("NFC") === b.normalize("NFC"); // true

Where this bites in production

  • Authentication — a user signs up as josé and logs in as josé (decomposed). They never log in again. Always normalize before comparing usernames or hashing passwords.
  • Search — without normalization, a search for naïve misses documents that wrote naïve with a combining diaeresis.
  • Filename collisions — macOS HFS+ stores filenames in a near-NFD form; APFS and Linux store them in whatever bytes you pass. A folder copied between them can produce two files that look identical.
  • Hash digestssha256("café") is one digest if the input is NFC and a different digest if it's NFD. Sign the bytes you mean to sign.

Rules of thumb

  1. Always store and transmit UTF-8. Modern protocols (HTTP, JSON, gRPC) assume it. UTF-16 is for Windows internals and Java strings; UTF-32 is rarely worth the space.
  2. Normalize on the boundary. When data arrives from the user, normalize to NFC. When it leaves, leave it alone.
  3. Don't trust .length. If you need to count "characters" the way a user does, segment by grapheme cluster — Intl.Segmenter in JS, regex with \X in PCRE, unicodeseg in Go.
  4. Pick the right boundary unit for the job. Bytes for storage and network, code units for legacy APIs, codepoints for a programmer's view, graphemes for a user's view.

Tools in the wild

4 tools
  • iconvfree tier

    Convert between encodings (UTF-8 ↔ UTF-16 ↔ Latin-1) on the command line.

    cli
  • ICUfree tier

    The reference Unicode library — used by every major language runtime under the hood.

    library
  • Browser-native grapheme/word/sentence segmentation. The right way to count emoji.

    library
  • Normalize, lookup, and inspect Unicode codepoints from Python.

    library