Cryptanalysis - Vigenère Cipher

04 May 2015

In this post I cover the Vigenère cipher. It's an encryption system that builds on the Caesar cipher covered in the previous post, with the additional twist that it uses multiple shift values instead of one.

Caesar's cipher shifts every character in the plaintext by a single shift value, resulting in ciphertext with a non-uniform distribution of characters. This means we can look at the frequencies of characters in the ciphertext and easily work out the shift value. The Vigenère cipher, however, uses several different shift values, which means the ciphertext has a much more uniform distribution of characters, making it harder to decrypt using the same frequency analysis method.

Encryption

A series of characters are used as the key (instead of just 1 value in \(\{0…|alphabet|-1\}\) for Caesar's cipher). Each character in the key represents a shift value.

Plaintext: THETRUTHISRARELYPUREANDNEVERSIMPLE
Key:       WILDE

The key is then repeated so its length matches that of the plaintext.

plaintext: THETRUTHISRARELYPUREANDNEVERSIMPLE
key:       WILDEWILDEWILDEWILDEWILDEWILDEWILD

For every character in the key, its index in the alphabet is used as the shift value for the corresponding plaintext character. This means we have n Caesar ciphers at work, where n is the number of characters in the key.

\(C_{i} = (M_{i} + K_{i})\quad mod\quad26\)

|A |B |C |D |E |F |G |H |I |J |K |L |M |N |O |P |Q |R |S |T |U |V |W |X |Y |Z |
|0 |1 |2 |3 |4 |5 |6 |7 |8 |9 |10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|

Plaintext indices:
T = 19
H = 7
E = 5
T = 19
R = 17
U = 20
T = 19
H = 7
…

Key indices:
W = 22
I = 8
L = 11
D = 3
E = 4

T = (19 + 22) mod 26 = 15 = P
H = (7 + 8)   mod 26 = 15 = P
E = (4 + 11)  mod 26 = 15 = P
T = (19 + 3)  mod 26 = 22 = W
R = (17 + 4)  mod 26 = 21 = V
U = (20 + 22) mod 26 = 16 = Q
T = (19 + 8)  mod 26 =  1 = B
H = (7 + 11)  mod 26 = 18 = S
…

Ciphertext: PPPWVQBSLWNICHPUXFUIWVOQIRMCVMIXWH

Here's the Java code to encrypt:

public static String encrypt(String plaintext, String key) {
    // key must be an alphabetic string
    if (!key.matches("[a-zA-Z]+"))
        throw new IllegalArgumentException("Invalid key - must be one or more characters in range a...z");

    // only interested in the alphabet
    plaintext = plaintext.replaceAll("[^a-zA-Z]", "").toUpperCase();

    // make key same length as plaintext
    key = StringUtils.repeatString(key, plaintext.length()).toUpperCase();

    StringBuilder ciphertext = new StringBuilder();
    for (int i = 0; i < plaintext.length(); i++) {
        // get the character at index i
        String toEncrypt = String.valueOf(plaintext.charAt(i));
        // the shift is equal to the char at index i of the key
        // subtracting A (65) to have a value in range 0-25
        int shift = key.charAt(i) - 'A';
        ciphertext.append(CaesarCipher.encrypt(toEncrypt, shift));
    }
    return ciphertext.toString();
}

Decryption

Just like encryption, we repeat the key until it's the length of the ciphertext:

ciphertext: PPPWVQBSLWNICHPUXFUIWVOQIRMCVMIXWH
key:        WILDEWILDEWILDEWILDEWILDEWILDEWILD

We can then call the Caesar cipher's decrypt function for every ciphertext character and corresponding key character. This will shift each character in the ciphertext backwards by the value of the key character.

\(M_{i} = (C_{i} - K_{i})\quad mod\quad26\)

Ciphertext indices:
P = 15
P = 15
P = 15
W = 22
V = 21
Q = 16
B = 1
S = 18
L = 11
…

Key indices:
W = 22
I = 8
L = 11
D = 3
E = 4

P = (15 - 22) mod 26 = 19 = T
P = (15 - 8)  mod 26 =  7 = H
P = (15 - 11) mod 26 =  4 = E
W = (22 - 3)  mod 26 = 19 = T
V = (21 - 4)  mod 26 = 17 = R
Q = (16 - 22) mod 26 = 20 = U
B = (1  - 8)  mod 26 = 19 = T
S = (18 - 11) mod 26 =  7 = H
…

Plaintext: THETRUTHISRARELYPUREANDNEVERSIMPLE

Here's the Java code to decrypt:

public static String decrypt(String ciphertext, String key) {
    // key must be an alphabetic string
    if (key == null || !key.matches("[a-zA-Z]+"))
        throw new IllegalArgumentException("Invalid key - must be one or more characters in range a...z");

    // only interested in the alphabet
    ciphertext = ciphertext.replaceAll("[^a-zA-Z]", "").toUpperCase();

    // make key same length as plaintext
    key = StringUtils.repeatString(key, ciphertext.length()).toUpperCase();

    StringBuilder plaintext = new StringBuilder();
    for (int i = 0; i < ciphertext.length(); i++) {
        // get the character at index i
        String toDecrypt = String.valueOf(ciphertext.charAt(i));
        // the shift is equal to the char at index i of the key
        // subtracting 'A' (65) to have a value in range 0-25
        int shift = key.charAt(i) - 'A';
        plaintext.append(CaesarCipher.decrypt(toDecrypt, shift));
    }
    return plaintext.toString();
}

Cryptanalysis

When we don't have the key, deciphering text encrypted using the Vigenère cipher is slightly more involved than the methods we used for the Caesar cipher. Firstly, instead of just one shift value, there are many. In fact, there are potentially as many shift values as the length of the message (e.g. plaintext: "hello", key: "world"). This means that two characters with the same value in the plaintext can be encrypted to characters with different values in the ciphertext, which gives a much more uniform distribution of characters. It also means the frequency analysis approach we used for the Caesar cipher won't work here.

Brute Force

Using a brute force attack on the Caesar cipher wasn't that terrible really. The maximum number of shifts to try before finding the plaintext was the length of the alphabet – 1 (\(|A|-1\)), which in the case of the English alphabet is just 25. However, with the Vigenère cipher we don't know how long the key is. This means we need to try keys from length 1 to the length of the ciphertext. For every key length, we need to create keys of every combination of characters and then try decrypting the text with it.

An algorithm for a brute force attack might look something like:

FOR keyLen = 1 to cipherLen STEP 1 DO
    FOR combIndex = 0 to (|alphabet|^keyLen)-1 STEP 1 DO
        key = getPermutation(alphabet, keyLen, combIndex);
        PRINT VigenèreCipher.decrypt(ciphertext, key);
    OD
OD

Which is clearly awful. You'd be looping 26ⁿ, for n = 1 to the length of the ciphertext. If the ciphertext was of length 5, this would mean 26¹ + 26² + 26³ + 26⁴ + 26⁵ iterations. There must be a quicker way…

Index of Coincidence

The Ic (index of coincidence) of some string \(s\) or \(I_{c}(s)\) is the probability that any two randomly chosen characters in \(s\) are identical. Here's the formula (practicalcryptography.com have more to say on it):

\(\sum_{i = A}^{i = Z}\frac{(F_{i})(F_{i}-1)}{N(N-1)}\)

Where \(F_{i}\) is the total number of times the \(i\)th character occurred in the ciphertext, and \(N\) is the length of the ciphertext.

As discussed in the previous post, we know that text in the English language has a non-uniform distribution of characters, i.e. the letter E occurs about 12.7% of the time, followed by T (9.1%), then A (8.2%), etc. Knowing this we were able to decrypt text encrypted using the Caesar cipher because those frequencies didn't change when the plaintext was encrypted, only the characters occurring at those frequencies did. We shifted the alphabet until the frequencies of characters matched the expected frequencies for characters in the English language.

It turns out that English text usually has an Ic value of around 0.067. If we had calculated the Ic for text encrypted using the Caesar cipher, the value would have been exactly the same for the plaintext. This is because the formula only takes into account frequencies of characters, it doesn't know/care what character occurs at a given frequency. This is useful to know when trying to calculate the key length, because we are looking for a value such that each Caesar cipher in the key produces text with an Ic "close" to the English language Ic.

Index of Coincidence- calculating the key length

Let us loop through every possible key length (1 .. length of cipher text), and calculate the Ic for strings made of up of characters encrypted using the same key character. The key length that produces an average Ic close to the expected Ic for English text (0.067) is our best guess at the key length.


CIPHERTEXT:

HAKQUMKNMJENPCACRMGDUCCDYAMGRLQFMJENPFTUHBQN
TDLXGNWQFMJEPGSMGRGUPBTAECRFQFMJEZCLTZYEKELC
SFCLEWNKGGTTDXFYXNLHYSNPOKDIMKNZVHBUAMCDBUTT
PCXQFKQUZJLRPIGGTRVWHOIENIHPMBNELKSTPUMVEKNY
BPSBINBHIVCNMNIMVLXDLNGGKGEGRLTPEMYHHUETREWG
SVGNWGDEKFXHOKOSTTELQAFCZBPGEAPKKMBVIOGTACTM
JERUTBNLMJIGMDBIIMCLPCTVJELCRXCPKGTMANXCTBFET

Key length 2:
ciphertext: H|A|K|Q|U|M|K|N|M|J|E|N|P|C|A|C|R|M|G|D|U|C|C|D|Y|A|M|...
key pos:    1|2|1|2|1|2|1|2|1|2|1|2|1|2|1|2|1|2|1|2|1|2|1|2|1|2|1|...

encrypted with 1st character of key:
  HKUKMEPARGUCYM...
  Index of coincidence: 0.0423963133640553

encrypted with 2nd character of key:
  AQMNJNCCMDCDA...
  Index of coincidence: 0.0432051608522197

Average index of coincidence: 0.04280073710813749
Close to English: FALSE


Key length 3:
ciphertext: H|A|K|Q|U|M|K|N|M|J|E|N|P|C|A|C|R|M|G|D|U|C|C|D|Y|A|M|...
key pos:    1|2|3|1|2|3|1|2|3|1|2|3|1|2|3|1|2|3|1|2|3|1|2|3|1|2|3|...

encrypted with 1st character of key:
  HQKJPCGCY...
  Index of coincidence: 0.07024557395773844

encrypted with 2nd character of key:
  AUNECRDCA...
  Index of coincidence: 0.06396344945745289

encrypted with 3rd character of key:
  KMMNAMUDM...
  Index of coincidence: 0.07614696363982486

Average index of coincidence: 0.07011866235167206
Close to English: TRUE

Best guess at key length = 3

There we have it, we determined the key length to be 3. In reality, this approach might not be the most effective. For example, we stopped when we reached an Ic value "close" to the Ic for English text, but it could be the case that there are other key lengths that we didn't check that produce closer values.

Calculating the key from its length

Now that we know the key length, we need to solve each caesar cipher in the key. For example, if our Vigenère cipher had a key length of 4, we would have 4 Caesar ciphers solve.

For every Caesar cipher we need to extract each character from the ciphertext that was encrypted using it, like so:

Key length 3:
ciphertext: H|A|K|Q|U|M|K|N|M|J|E|N|P|C|A|C|R|M|G|D|U|C|C|D|Y|A|M|...
key pos:    1|2|3|1|2|3|1|2|3|1|2|3|1|2|3|1|2|3|1|2|3|1|2|3|1|2|3|...

encrypted with 1st Caesar cipher:
  HQKJPCGCY...

encrypted with 2nd Caesar cipher:
  AUNECRDCA...

encrypted with 3rd Caesar cipher:
  KMMNAMUDM...

We can then apply the Caesar cipher frequency analysis attack on each Caesar cipher's encrypted text. This should give us the most likely shift value used for the corresponding Caesar cipher.

Once we know the shift value of each Caesar cipher, we will also know the key used in the encryption of the Vigenère cipher:

CaesarCipher.calculateShift("HQKJPCGCY..."); // returns 2
CaesarCipher.calculateShift("AUNECRDCA..."); // returns 0
CaesarCipher.calculateShift("KMMNAMUDM..."); // returns 19

Note: 'A' is the starting ASCII value of uppercase characters, so 5 + 'A' = 5 + 65 = 70 = ASCII character F

key   = 2, 0, 19
  = 2 + 'A',  0 + 'A', 19 + 'A'
  = 2 + 65, 0 + 65, 19 + 65
  = 67, 65, 84
  = C, A, T
  = CAT

All that's left to do now is to pop the key into the Vigenère decryption function to see if it successfully decrypts the ciphertext:

VigenereCipher.decrypt("HAKQUMKNMJENPCACRMGDUCCDYAMGRLQFMJENPFTUHBQN...", "CAT");
returns:
FAROUTINTHEUNCHARTEDBACKWATERSOFTHEUNFASHION
ABLEENDOFTHEWESTERNSPIRALARMOFTHEGALAXYLIESA
SMALLUNREGARDEDYELLOWSUNORBITINGTHISATADISTA
NCEOFROUGHLYNINETYTWOMILLIONMILESISANUTTERLY
INSIGNIFICANTLITTLEBLUEGREENPLANETWHOSEAPEDE
SCENDEDLIFEFORMSARESOAMAZINGLYPRIMITIVETHATT
HEYSTILLTHINKDIGITALWATCHESAREAPRETTYNEATIDEA

Sweet, an excerpt from The Hitchhiker's Guide to the Galaxy!

Conclusion

At first glance the Vigenère cipher appears to be a much harder cipher to crack than the Caesar cipher; it has a more uniform distribution of characters, so a basic frequency analysis won't do. However, since this cipher is made up of multiple Caesar ciphers, once we know the key length we can simply call our Caesar cipher attack code on each Caesar cipher to determine the key.

The index of coincidence attack becomes much more effective when the ciphertext is a large string because we can then get an accurate indication of whether each Caesar cipher's ciphertext has an Ic close to the Ic for English text. When the ciphertext is a short string we cannot accurately determine this, and therefore we cannot automatically determine the Vigenère cipher key. In this case it would be a good idea to alter the algorithm to return the 3-5 most likely key lengths.

I've implemented the Vigenère cipher encryption/decryption and attacks in Java- you can grab the code here.