ciphercryptanalysisfrequencysubstitution

Frequency Analysis: How to Break Substitution Ciphers

Learn how frequency analysis breaks substitution ciphers by exploiting letter frequency patterns in English. Includes step-by-step worked example.

April 20, 20268 min readBy Stephen

Frequency analysis is the technique that breaks most substitution ciphers. It works because English (and every natural language) has predictable letter frequency patterns — E appears about 12.7% of the time, T about 9.1%, and Z barely 0.07%. When a substitution cipher replaces each letter with a fixed substitute, those frequency patterns survive encryption. The most common letter in the ciphertext is almost certainly E.

This simple insight has been breaking ciphers for over a thousand years, and it remains the first tool any cryptanalyst reaches for when facing an unknown monoalphabetic substitution.

A Brief History

The Arab polymath Al-Kindi (Abu Yusuf Yaqub ibn Ishaq al-Kindi) described frequency analysis in the 9th century in his manuscript On Deciphering Cryptographic Messages. This makes it one of the oldest known techniques in cryptanalysis — and it was so effective that it eventually rendered simple substitution ciphers obsolete for military use.

Al-Kindi's method was straightforward: count the letters in the ciphertext, compare the counts to known letter frequencies in the target language, and make substitutions. The technique spread through the Islamic world and eventually reached Europe, where it drove the development of more complex ciphers like the Vigenere.

English Letter Frequencies

Here are the standard letter frequencies for English text, based on large corpus analysis:

Letter	Frequency	Letter	Frequency	Letter	Frequency
E	12.70%	D	4.25%	B	1.49%
T	9.06%	L	4.03%	G	2.02%
A	8.17%	C	2.78%	V	0.98%
O	7.51%	U	2.76%	K	0.77%
I	6.97%	M	2.41%	X	0.15%
N	6.75%	W	2.36%	Q	0.10%
S	6.33%	F	2.23%	J	0.15%
H	6.09%	Y	1.97%	Z	0.07%
R	5.99%	P	1.93%

The mnemonic ETAOIN SHRDLU (the 12 most common letters in rough order) dates back to Linotype typesetting machines and is still used by codebreakers today.

How to Apply Frequency Analysis Step-by-Step

Step 1: Count Every Letter

Go through the ciphertext and tally every occurrence of every letter. With fewer than 50 characters, frequency analysis becomes unreliable — the more text you have, the better this works.

Step 2: Rank by Frequency

Sort the ciphertext letters from most to least frequent. The most common letter is your strongest candidate for E.

Step 3: Make Your First Guesses

Map the most frequent ciphertext letters to the most frequent English letters:

Most common ciphertext letter → likely E
Second most common → likely T or A
Third most common → likely A, O, or T

These are guesses, not certainties. Context will confirm or reject them.

Step 4: Look for Common Patterns

Beyond single letters, look for frequent pairs (bigrams) and triples (trigrams):

Most common bigrams: TH, HE, IN, EN, AN, ER, RE, ON, NT, ES

Most common trigrams: THE, AND, ING, HER, HAT, HIS, THA, ERE, FOR

If a three-letter group appears frequently in the ciphertext, try mapping it to THE first. If a two-letter group is very common, try TH or HE.

Step 5: Trial and Error

Fill in your guesses and read the partially decoded text. Some words will become recognizable. Each confirmed word reveals more letter mappings, which reveal more words, creating a cascade effect.

For example, if you've determined that three ciphertext letters map to T, H, and E, and you see the pattern _THE_ in the partial decode, the first letter might be O (OTHER), A (ATHENA isn't common), or I (ITHER isn't a word) — so O is the strongest guess.

Worked Example

Here's a ciphertext to crack:

XMJ VZNHP GWTES KTC ONRUY TAJW XMJ QFED ITC

Step 1: Count frequencies.

Letter	Count	Letter	Count
T	4	J	3
X	2	N	2
F	2	W	1
M	1	E	1
V	1	Z	1
H	1	P	1
G	1	S	1
K	1	C	1
O	1	R	1
U	1	D	1
Q	1	I	1
Y	1

Step 2: Spot patterns. "XMJ" appears twice — a common three-letter word, likely THE. That gives us X=T, M=H, J=E.

Step 3: Substitute and look for more.

THE _Z_HP GW_ES K_T _N_U_ T_EW THE Q_E_ I_T

With X=T, M=H, J=E established, look at "K_T" — a three-letter word ending in T. Could be CAT, FAT, HAT, BUT, CUT, etc. Look at "GW_ES" — five letters ending in ES with W in position 2. Could be BROWN (if shifted).

Step 4: Try Caesar shift. This particular ciphertext has a pattern — every letter is shifted by the same amount. Testing: X→T is a shift of 5 backward (or 21 forward). Checking: V→R (shift 21 ✓), Z→V? No, that gives V. Actually, checking more carefully: V→Q? Let me try shift 5 forward: each plaintext letter was shifted forward by 5.

Decoding with Caesar shift of 5:

THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG

This example happened to be a Caesar cipher, which frequency analysis handles easily. For a random substitution (where each letter maps independently), the process takes longer but follows the same logic.

The Index of Coincidence

The Index of Coincidence (IC) measures how likely two randomly selected letters from a text are to be the same. It helps determine whether a cipher is monoalphabetic (one alphabet) or polyalphabetic (multiple alphabets, like Vigenere).

English text IC: approximately 0.0667
Random text IC: approximately 0.0385

If the ciphertext's IC is close to 0.067, it's likely a monoalphabetic substitution — frequency analysis will work directly. If the IC is closer to 0.038, the cipher is polyalphabetic, and you'll need to first determine the key length (using the Kasiski examination) before applying frequency analysis to each key position separately.

When Frequency Analysis Fails

Frequency analysis is powerful against monoalphabetic substitution but fails against several stronger cipher types:

Polyalphabetic ciphers like Vigenere use multiple substitution alphabets, flattening the frequency distribution. Frequency analysis only works after you've separated the ciphertext into groups by key position.

Transposition ciphers rearrange letters without changing them, so the ciphertext has the exact same frequency distribution as English — but the letters are in the wrong order. Frequency analysis confirms it's a transposition but can't solve it directly. Try a Rail Fence or columnar transposition decoder instead.

Homophonic substitution maps common letters to multiple ciphertext symbols (E might be represented by four different symbols), deliberately flattening frequencies. The Zodiac Killer's Z408 used this technique.

Short texts don't contain enough data for reliable frequency statistics. Below about 25 characters, the natural variation in letter usage overwhelms the expected frequency pattern.

Try Our Cipher Solver

Our substitution cipher solver applies frequency analysis automatically — paste in your ciphertext and it maps the most likely letter substitutions, letting you refine the solution interactively. For identifying unknown cipher types, our cipher identifier analyzes character frequencies, patterns, and structure to suggest the most probable cipher.

Frequently Asked Questions

How much ciphertext do I need for frequency analysis to work?

Generally, 25-50 characters gives rough results; 100+ characters gives reliable results; 200+ characters makes the analysis almost automatic. Shorter texts require more reliance on pattern recognition (common words, bigrams) than pure frequency counting.

Does frequency analysis work for languages other than English?

Yes — every language has characteristic letter frequencies. French favors E, S, A, I, N; German favors E, N, I, S, R; Spanish favors E, A, O, S, R. You just need the correct frequency table for the target language.

Can a computer do frequency analysis automatically?

Absolutely. Automated solvers can try millions of substitution mappings per second, using frequency matching, bigram analysis, and dictionary lookups to converge on the solution. Our cipher tools do exactly this.

What's the difference between frequency analysis and brute force?

Brute force tries every possible key (all 25 Caesar shifts, or all 26! substitution alphabets). Frequency analysis uses statistical patterns to narrow down the correct key intelligently. For a Caesar cipher, both are fast. For a general substitution cipher with 26! possible keys (about 4 × 10²⁶), brute force is impossible but frequency analysis remains practical.

Written by

Stephen

Stephen has 5 years of experience in cybersecurity and software testing for fraud detection and compliance. His background in security systems and cryptographic principles drives SolveCipher's commitment to accuracy and practical cipher education. About SolveCipher