Frequency Analysis: How to Break Substitution Ciphers
Learn how frequency analysis breaks substitution ciphers by exploiting letter frequency patterns in English. Includes step-by-step worked example.
Frequency analysis is the technique that breaks most substitution ciphers. It works because English (and every natural language) has predictable letter frequency patterns — E appears about 12.7% of the time, T about 9.1%, and Z barely 0.07%. When a substitution cipher replaces each letter with a fixed substitute, those frequency patterns survive encryption. The most common letter in the ciphertext is almost certainly E.
This simple insight has been breaking ciphers for over a thousand years, and it remains the first tool any cryptanalyst reaches for when facing an unknown monoalphabetic substitution.
A Brief History
The Arab polymath Al-Kindi (Abu Yusuf Yaqub ibn Ishaq al-Kindi) described frequency analysis in the 9th century in his manuscript On Deciphering Cryptographic Messages. This makes it one of the oldest known techniques in cryptanalysis — and it was so effective that it eventually rendered simple substitution ciphers obsolete for military use.
Al-Kindi's method was straightforward: count the letters in the ciphertext, compare the counts to known letter frequencies in the target language, and make substitutions. The technique spread through the Islamic world and eventually reached Europe, where it drove the development of more complex ciphers like the Vigenere.
English Letter Frequencies
Here are the standard letter frequencies for English text, based on large corpus analysis:
| Letter | Frequency | Letter | Frequency | Letter | Frequency | |--------|-----------|--------|-----------|--------|-----------| | E | 12.70% | D | 4.25% | B | 1.49% | | T | 9.06% | L | 4.03% | G | 2.02% | | A | 8.17% | C | 2.78% | V | 0.98% | | O | 7.51% | U | 2.76% | K | 0.77% | | I | 6.97% | M | 2.41% | X | 0.15% | | N | 6.75% | W | 2.36% | Q | 0.10% | | S | 6.33% | F | 2.23% | J | 0.15% | | H | 6.09% | Y | 1.97% | Z | 0.07% | | R | 5.99% | P | 1.93% | | |
The mnemonic ETAOIN SHRDLU (the 12 most common letters in rough order) dates back to Linotype typesetting machines and is still used by codebreakers today.
How to Apply Frequency Analysis Step-by-Step
Step 1: Count Every Letter
Go through the ciphertext and tally every occurrence of every letter. With fewer than 50 characters, frequency analysis becomes unreliable — the more text you have, the better this works.
Step 2: Rank by Frequency
Sort the ciphertext letters from most to least frequent. The most common letter is your strongest candidate for E.
Step 3: Make Your First Guesses
Map the most frequent ciphertext letters to the most frequent English letters:
- Most common ciphertext letter → likely E
- Second most common → likely T or A
- Third most common → likely A, O, or T
These are guesses, not certainties. Context will confirm or reject them.
Step 4: Look for Common Patterns
Beyond single letters, look for frequent pairs (bigrams) and triples (trigrams):
Most common bigrams: TH, HE, IN, EN, AN, ER, RE, ON, NT, ES
Most common trigrams: THE, AND, ING, HER, HAT, HIS, THA, ERE, FOR
If a three-letter group appears frequently in the ciphertext, try mapping it to THE first. If a two-letter group is very common, try TH or HE.
Step 5: Trial and Error
Fill in your guesses and read the partially decoded text. Some words will become recognizable. Each confirmed word reveals more letter mappings, which reveal more words, creating a cascade effect.
For example, if you've determined that three ciphertext letters map to T, H, and E, and you see the pattern _THE_ in the partial decode, the first letter might be O (OTHER), A (ATHENA isn't common), or I (ITHER isn't a word) — so O is the strongest guess.
Worked Example
Here's a ciphertext to crack:
XMJ VZNHP GWTES KTC ONRUY TAJW XMJ QFED ITC
Step 1: Count frequencies.
| Letter | Count | Letter | Count | |--------|-------|--------|-------| | T | 4 | J | 3 | | X | 2 | N | 2 | | F | 2 | W | 1 | | M | 1 | E | 1 | | V | 1 | Z | 1 | | H | 1 | P | 1 | | G | 1 | S | 1 | | K | 1 | C | 1 | | O | 1 | R | 1 | | U | 1 | D | 1 | | Q | 1 | I | 1 | | Y | 1 | | |
Step 2: Spot patterns. "XMJ" appears twice — a common three-letter word, likely THE. That gives us X=T, M=H, J=E.
Step 3: Substitute and look for more.
THE _Z_HP GW_ES K_T _N_U_ T_EW THE Q_E_ I_T
With X=T, M=H, J=E established, look at "K_T" — a three-letter word ending in T. Could be CAT, FAT, HAT, BUT, CUT, etc. Look at "GW_ES" — five letters ending in ES with W in position 2. Could be BROWN (if shifted).
Step 4: Try Caesar shift. This particular ciphertext has a pattern — every letter is shifted by the same amount. Testing: X→T is a shift of 5 backward (or 21 forward). Checking: V→R (shift 21 ✓), Z→V? No, that gives V. Actually, checking more carefully: V→Q? Let me try shift 5 forward: each plaintext letter was shifted forward by 5.
Decoding with Caesar shift of 5:
THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG
This example happened to be a Caesar cipher, which frequency analysis handles easily. For a random substitution (where each letter maps independently), the process takes longer but follows the same logic.
The Index of Coincidence
The Index of Coincidence (IC) measures how likely two randomly selected letters from a text are to be the same. It helps determine whether a cipher is monoalphabetic (one alphabet) or polyalphabetic (multiple alphabets, like Vigenere).
- English text IC: approximately 0.0667
- Random text IC: approximately 0.0385
If the ciphertext's IC is close to 0.067, it's likely a monoalphabetic substitution — frequency analysis will work directly. If the IC is closer to 0.038, the cipher is polyalphabetic, and you'll need to first determine the key length (using the Kasiski examination) before applying frequency analysis to each key position separately.
When Frequency Analysis Fails
Frequency analysis is powerful against monoalphabetic substitution but fails against several stronger cipher types:
Polyalphabetic ciphers like Vigenere use multiple substitution alphabets, flattening the frequency distribution. Frequency analysis only works after you've separated the ciphertext into groups by key position.
Transposition ciphers rearrange letters without changing them, so the ciphertext has the exact same frequency distribution as English — but the letters are in the wrong order. Frequency analysis confirms it's a transposition but can't solve it directly. Try a Rail Fence or columnar transposition decoder instead.
Homophonic substitution maps common letters to multiple ciphertext symbols (E might be represented by four different symbols), deliberately flattening frequencies. The Zodiac Killer's Z408 used this technique.
Short texts don't contain enough data for reliable frequency statistics. Below about 25 characters, the natural variation in letter usage overwhelms the expected frequency pattern.
Try Our Cipher Solver
Our substitution cipher solver applies frequency analysis automatically — paste in your ciphertext and it maps the most likely letter substitutions, letting you refine the solution interactively. For identifying unknown cipher types, our cipher identifier analyzes character frequencies, patterns, and structure to suggest the most probable cipher.
Frequently Asked Questions
How much ciphertext do I need for frequency analysis to work?
Generally, 25-50 characters gives rough results; 100+ characters gives reliable results; 200+ characters makes the analysis almost automatic. Shorter texts require more reliance on pattern recognition (common words, bigrams) than pure frequency counting.
Does frequency analysis work for languages other than English?
Yes — every language has characteristic letter frequencies. French favors E, S, A, I, N; German favors E, N, I, S, R; Spanish favors E, A, O, S, R. You just need the correct frequency table for the target language.
Can a computer do frequency analysis automatically?
Absolutely. Automated solvers can try millions of substitution mappings per second, using frequency matching, bigram analysis, and dictionary lookups to converge on the solution. Our cipher tools do exactly this.
What's the difference between frequency analysis and brute force?
Brute force tries every possible key (all 25 Caesar shifts, or all 26! substitution alphabets). Frequency analysis uses statistical patterns to narrow down the correct key intelligently. For a Caesar cipher, both are fast. For a general substitution cipher with 26! possible keys (about 4 × 10²⁶), brute force is impossible but frequency analysis remains practical.