classical · level 5

Frequency Analysis

The technique that broke every classical cipher.

130 XP

Frequency Analysis

Every classical cipher prior to the polyalphabetic family was broken by the same technique: count the letters. Different letters appear with very different frequencies in natural language; substitution ciphers preserve those frequencies; the attacker reads them off the ciphertext and matches them to the plaintext distribution.

The technique, in one paragraph

In English text:

  • E appears about 12.7% of the time.
  • T appears about 9.1%.
  • A, O, I, N, S, H, R each appear roughly 6–8%.
  • Z, Q, X, J each appear less than 0.2%.

A simple substitution cipher replaces each plaintext letter with a fixed ciphertext letter. If plaintext E maps to ciphertext F, then F appears 12.7% of the time in the ciphertext. The frequency distribution is preserved — only the labels are shuffled.

So the attacker:

  1. Counts each letter in the ciphertext.
  2. Sorts them by frequency.
  3. Maps the most common ciphertext letter to E, second-most to T, third to A, and so on.
  4. Reads the partially-decoded text and tweaks the mapping until it makes sense.

For a long enough ciphertext, this works almost mechanically.

ETAOIN SHRDLU

The classic mnemonic for the 12 most-common English letters in descending frequency:

E T A O I N   S H R D L U
12.7 9.1 8.2 7.5 7.0 6.7   6.3 6.1 6.0 4.3 4.0 2.8     (% of letters)

Memorise this and you can crack a substitution cipher of a few hundred characters by hand in 15-20 minutes.

Bigrams and trigrams

Single-letter frequency is the entry point. Bigram and trigram statistics are sharper signals:

Top bigrams Top trigrams Top double letters
TH (3.9%) THE (3.5%) LL (0.6%)
HE (3.7%) AND (1.6%) EE (0.4%)
IN (2.3%) ING (1.1%) SS (0.4%)
ER (2.0%) ION (0.9%) OO (0.4%)
AN (1.6%) TIO (0.8%) TT (0.4%)
RE (1.4%) ENT (0.7%) FF (0.3%)

If you've identified that ciphertext F is plaintext E, and you see the ciphertext bigram XF appearing 3% of the time, X is almost certainly H — making XF the encryption of HE.

Index of coincidence

A useful summary statistic, defined for a text of length N:

IoC = Σ (n_i × (n_i − 1)) / (N × (N − 1))

where n_i is the count of each letter. Intuitively: the probability that two random letters from the text are equal.

Distribution IoC
English plaintext ~0.067
Random text (uniform) ~0.038
Substitution cipher of English ~0.067 (preserved)
Polyalphabetic cipher (Vigenère et al.) ~0.038–0.045 (depending on key length)

The IoC tells you whether you're dealing with a single-alphabet substitution (high IoC, attackable by simple frequency) or a polyalphabetic cipher (low IoC, requires Kasiski/Friedman period detection first).

Cribs

A crib is a known or guessed plaintext word. You guess that a ciphertext fragment is "THE" or "AND" or "MEMO" or "SECRET", and that gives you 3-5 letter mappings instantly. Repeat with another guess; cross-check; iterate.

Bletchley Park's crib-driven attacks against Enigma were industrial-scale extensions of this same technique. They knew that German weather reports always started with "WETTER" and that some operators reliably typed "HEILXHITLER" — those known plaintext fragments cracked the daily key.

When frequency analysis breaks down

Three classical cipher families resist single-letter frequency analysis:

  1. Polyalphabetic (Vigenère, Beaufort, Enigma) — multiple substitution alphabets in rotation; per-position frequencies flatten out. Attacked by Kasiski examination to find the period, then frequency analysis on each alphabet separately.
  2. Transposition (rail fence, columnar) — letters are reordered, not substituted. Letter-frequency distribution is identical to plaintext, so single-letter attack fails. But bigram/trigram frequency analysis still helps; you'd see "no THs but every other letter exactly normal" — diagnostic of transposition.
  3. One-time pad — truly random key, used once. Per-position output is uniform. No statistical signal whatsoever.

That third bullet is why OTP is the only theoretically unbreakable classical cipher. Everything else leaks.

Al-Kindi (the OG)

Frequency analysis has a long history. The first systematic written description we have is from Abu Yusuf Yaqub ibn Ishaq al-Kindi, a 9th-century Iraqi polymath. His treatise "On Deciphering Cryptographic Messages" (~850 CE) lays out the method in detail — letter counting, language statistics, the use of context — about 700 years before European cryptanalysts independently rediscovered it.

When you read about Renaissance court spies "breaking ciphers," they were rediscovering Al-Kindi's work.

What this lesson asks of you

The playground hands you ciphertexts of varying length and lets you map letters interactively, watching the partial plaintext emerge. The visualizer shows the frequency histogram of the ciphertext alongside the expected English distribution; visual alignment is a much more intuitive way to crib than staring at percentages.

Tools in the wild

4 tools
  • CrypToolfree tier

    Browser and desktop frequency-analysis tools, plus full classical-cipher cryptanalysis suite.

    service
  • dCodefree tier

    Online classical-cipher analyser. Frequency tables, n-gram tools, autosolvers.

    service
  • quipqiupfree tier

    Auto-solves substitution ciphers via frequency analysis + dictionary attack.

    service
  • Python utilities for n-gram frequency, IoC, classical cipher cryptanalysis.

    library