datatypes · level 5

Strings & Encodings

UTF-8, UTF-16, surrogates, graphemes — why string length is a trick question.

250 XP

Strings & Encodings

Asking "how long is this string?" has at least four different answers, and picking the wrong one ships real bugs: truncated usernames, mis-billed SMS, broken regex matches, the .length === 8 family emoji.

Analogy

Think of asking "how long is that train?" and getting four totally different answers depending on who you ask. The conductor counts carriages (grapheme clusters — what a human sees). The freight yard counts axles (code points). The rail engineer counts couplings (code units in memory). The weighbridge reports kilograms of steel (bytes on disk). All four answers are about the same train; none of them agrees with the others. Telling the passenger "this train is 47 long" is useless unless you say which unit you're using.

Unicode: the platonic string

Unicode is a table of code points — abstract characters, each with a number. A is U+0041. é is U+00E9. is U+6F22. 😀 is U+1F600. There are over 150,000 assigned code points, with room for up to 1,114,112 total (U+10FFFF).

Code points are the semantic answer to "how many characters?" — but programs don't store code points directly, they store an encoding.

UTF-8

The dominant encoding on the wire and on disk. Variable-width, 1 to 4 bytes per code point:

Range Bytes Comment
U+0000–U+007F 1 ASCII — fully compatible
U+0080–U+07FF 2 Latin accented, Greek, Cyrillic, Arabic
U+0800–U+FFFF 3 Most CJK, Thai, Hebrew, etc.
U+10000–U+10FFFF 4 Emoji, rare CJK, historic scripts

UTF-8 is self-synchronising: you can start decoding from any byte and find the next code-point boundary quickly. It's the only sensible default for files, HTTP bodies, databases.

UTF-16

The native string representation in JavaScript, Java, C#, Windows APIs. 2 bytes for BMP (Basic Multilingual Plane = U+0000–U+FFFF), 4 bytes (a surrogate pair) for anything above.

A surrogate pair is two 16-bit code units: a high surrogate (0xD800–0xDBFF) followed by a low surrogate (0xDC00–0xDFFF). Together they encode one code point.

This is why "😀".length === 2 in JavaScript: 😀 is U+1F600, outside the BMP, so it's stored as two UTF-16 code units, and .length returns code units, not code points.

"hello".length;               // 5
"😀".length;                   // 2 ← surrogate pair
[..."😀"].length;              // 1 ← iterator walks code points
"😀".codePointAt(0);           // 0x1F600

UTF-32

Fixed-width: 4 bytes for every code point. Simple to index but wasteful — a page of English text takes 4× the space of UTF-8. Rarely used in practice; you'll see it as an intermediate representation inside Unicode processing libraries.

Grapheme clusters

A grapheme is what a human perceives as a single character. Code points don't quite match:

  • é can be stored as one code point (U+00E9) or two (U+0065 e + U+0301 combining acute). Both render as é.
  • Flag emoji are two "regional indicator" code points: 🇬🇧 = 🇬 + 🇧.
  • Family emoji are ZWJ sequences: 👨‍👩‍👧 = 👨 + ZWJ + 👩 + ZWJ + 👧 (5 code points, 1 grapheme, and in UTF-16, 8 code units).
"👨‍👩‍👧".length;                           // 8 — UTF-16 code units
[..."👨‍👩‍👧"].length;                       // 5 — code points
new TextEncoder().encode("👨‍👩‍👧").length; // 18 — UTF-8 bytes
// Grapheme count requires Intl.Segmenter:
const seg = new Intl.Segmenter("en", { granularity: "grapheme" });
[...seg.segment("👨‍👩‍👧")].length;          // 1

Picking the wrong counter in the wrong context:

  • Twitter-style character limits → grapheme count
  • Database column widths (VARCHAR(10)) → depends on the collation; usually code units or bytes
  • UTF-8 byte budgets (e.g. IndexedDB keys) → UTF-8 bytes
  • Slicing a string safely → code points (or graphemes if you want to respect emoji boundaries)

The rules

  1. Default to UTF-8 on the wire. Always declare charset=utf-8 on HTTP text responses.
  2. Use UTF-8 for source files, config files, log lines, database storage.
  3. When you need "how many characters the user sees", use graphemes (Intl.Segmenter in JS, grapheme in Python).
  4. When you need "how much space will this take in the database / on disk", use UTF-8 bytes.
  5. Don't index into strings with raw integers unless you know the code points are all BMP. "😀"[0] gives you half a surrogate pair — a lone invalid code unit.
  6. Never trust .length. It's code units in JS, code points in Python, bytes in Go and Rust.

Unicode is a sprawling standard, but the rule you can always fall back on is: be explicit about what you're counting.