Cryptanalysis - Caesar Cipher

How do we make sure the messages we send are only read by the intended recipient? This problem has been bothering people for a long time, with examples of encryption being used to hide messages going back to ancient Egypt, some 4000 years ago.

I will be writing a series of posts on ciphers used throughout history, detailing how they work and how they can be broken. In this post I discuss the Caesar cipher. It is perhaps the simplest of the classical ciphers and was successfully used by Julius Caesar to send military messages.

The cipher is based on shifting each character in the plaintext by x many characters in the alphabet using modular arithmetic, where x is some predetermined shift value known only to the sender and receiver. The cipher uses a shared secret, which means it is a symmetric cryptosystem.

Choosing a shift value

Let's say we've chosen the message we want to send to our recipient, Alice. The next thing we now need to do is pick a shift value that will be used when encrypting the message. If we use the English alphabet, we know there are 26 possible values for each character in the message. It makes sense therefore, to pick a shift value between 1 and 25, since after the modulo operation the result would be within that range anyway for all numbers (unless you chose a shift of 26 or 0, which would result in a shift of 0 after the modulo – don't do that).

Encryption

For every character in the plaintext we need to add the shift value to its position in the alphabet to get the encrypted character. We are essentially shifting all characters in the plaintext along by x in the alphabet.

\(C_{i} = (M_{i} + shift)\quad mod\quad26\)

Plain text:  ILOVECAKE
Shift value: 3

|A |B |C |D |E |F |G |H |I |J |K |L |M |N |O |P |Q |R |S |T |U |V |W |X |Y |Z |
|0 |1 |2 |3 |4 |5 |6 |7 |8 |9 |10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|

Our plain text has the following positions in the alphabet:

I = 8
L = 11
0 = 14
V = 21
E = 4
C = 2
A = 0
K = 10
E = 4

We can now add our shift value to these positions and perform the modulo operation to get the encrypted values:

I = (8 + 3)  mod 26 = 11 = L
L = (11 + 3) mod 26 = 14 = O
O = (14 + 3) mod 26 = 17 = R
V = (21 + 3) mod 26 = 24 = Y
E = (4 + 3)  mod 26 = 7  = H
C = (2 + 3)  mod 26 = 5  = F
A = (0 + 3)  mod 26 = 3  = D
K = (10 + 3) mod 26 = 13 = N
E = (4 + 3)  mod 26 = 7  = H

Ciphertext: LORYHFDNH

Now we've encrypted our secret message we can send it along to Alice, resting assured that any adversary seeing it on its way will be unable to understand the content (actually messages encrypted using Caesar's cipher are very easy to decipher without the key, but more on that later).

Here's the Java code to encrypt:

public static String encrypt(String plaintext, int shift) {
    // only interested in the alphabet
    plaintext = plaintext.replaceAll("[^a-zA-Z]", "").toUpperCase();
    StringBuilder ciphertext = new StringBuilder();
    for (char c : plaintext.toCharArray()) {
        // all upper case chars are in the ascii range 65-90.
        // Subtracting A (65) from from the character gives us a value in the range of 0 25
        int newPos = c - 'A';
        // add the shift to the position
        newPos += shift;
        // perform the modulo to make sure the result is in the range of 0-25
        newPos = Math.floorMod(newPos, LetterFrequencyUtils.ALPHABET_COUNT);
        // add A (65) to the value to get the uppercase character
        newPos += 'A';
        ciphertext.append((char) newPos);
    }
    return ciphertext.toString();
}

Decryption

Alice receives a message from Bob that has been encrypted using the Caesar cipher. Since both Alice and Bob previously agreed that they would use a shift value of 3, Alice can easily compute the original plaintext.

The method used to decrypt the ciphertext is the opposite to the one used for encrypting the plaintext. That is, instead of adding the shift value to each character, we need to subtract it, while still performing the modulo operation to wrap round to the end of the alphabet if result value is less than 0.

\(M_{i} = (C_{i} - shift)\quad mod\quad26\)

The ciphertext characters have the following indexes in the alphabet:

L = 11
O = 14
R = 17
Y = 24
H = 7
F = 5
D = 3
N = 13
H = 7


By subtracting the shift value from each character in the cipher text we can calculate the original values:

L = (11 - 3) mod 26 = 8  = I
O = (14 - 3) mod 26 = 11 = L
R = (17 - 3) mod 26 = 14 = O
Y = (24 – 3) mod 26 = 21 = V
H = (7 – 3)  mod 26 = 4  = E
F = (5 – 3)  mod 26 = 2  = C
D = (3 – 3)  mod 26 = 0  = A
N = (13 – 3) mod 26 = 10 = K
H = (7 – 3)  mod 26 = 4  = E

Plaintext: ILOVECAKE

There we have it- Bob encrypted his message, sent it to Alice and she successfully decrypted it to reveal his secret food of choice. But how do we know that the message wasn't intercepted? With this method of encryption, we don’t!

Here's the Java code to decrypt:

public static String decrypt(String ciphertext, int shift) {
    // only interested in the alphabet
    ciphertext = ciphertext.replaceAll("[^a-zA-Z]", "").toUpperCase();
    StringBuilder plaintext = new StringBuilder();
    for (char c : ciphertext.toCharArray()) {
        // all upper case chars are in the ascii range 65-90.
        // Subtracting A (65) from from the character gives us a value in the range of 0 25
        int newPos = c - 'A';
        // subtract the shift from the position
        newPos -= shift;
        // perform the modulo to make sure the result is in the range of 0-25
        newPos = Math.floorMod(newPos, LetterFrequencyUtils.ALPHABET_COUNT);
        // add A (65) to the value to get the uppercase character
        newPos += 'A';
        plaintext.append((char) newPos);
    }
    return plaintext.toString();
}

Cryptanalysis

Mallory knows that Bob will be sending his secret food of choice to Alice encrypted using Caesar's cipher. She also happens to know that Bob is allergic to nuts, and wants to maliciously modify his message to Alice to say his food preference is in fact the deadly peanut.

In order to decipher the message, she first needs to determine the shift value used. Thankfully (for her), the English alphabet has only 26 characters, which means the key space is 26. This means the shift value could be any integer between 0 and 26, but since 0 and 26 would result in no change of plaintext it is realistic to consider the possible keys as {1, 2, … , 25}.

Brute Force

Brute force attacks are usually the simplest to implement, but have a very low computational efficiency. This, however, is not so bad in the case where our key space is so small.

Mallory needs to apply every possible shift value to each character in the ciphertext until she finds the value that successfully transforms it into the original plaintext.

The method I demonstrate here requires human intervention in that for every possible shift value the decryption is applied and the result printed. The user (Mallory) must then look at each result and decide which is the original plain text.

public static void bruteForceAttack(String ciphertext) {
    // shift of 0 or 26 would result in no change
    for (int shift = 1; shift < LetterFrequencyUtils.ALPHABET_COUNT; shift++)
        System.out.println("shift: " + shift + ", " + decrypt(ciphertext, shift));
}

Mallory executes the brute force attack, and the following is printed:

Brute Force Attack
shift: 1, KNQXGECMG
shift: 2, JMPWFDBLF
shift: 3, ILOVECAKE
shift: 4, HKNUDBZJD
shift: 5, GJMTCAYIC
shift: 6, FILSBZXHB
shift: 7, EHKRAYWGA
shift: 8, DGJQZXVFZ
shift: 9, CFIPYWUEY
shift: 10, BEHOXVTDX
shift: 11, ADGNWUSCW
shift: 12, ZCFMVTRBV
shift: 13, YBELUSQAU
shift: 14, XADKTRPZT
shift: 15, WZCJSQOYS
shift: 16, VYBIRPNXR
shift: 17, UXAHQOMWQ
shift: 18, TWZGPNLVP
shift: 19, SVYFOMKUO
shift: 20, RUXENLJTN
shift: 21, QTWDMKISM
shift: 22, PSVCLJHRL
shift: 23, ORUBKIGQK
shift: 24, NQTAJHFPJ
shift: 25, MPSZIGEOI

Now that Mallory knows the shift value used was 3, she can encrypt the message "PEANUTS" with the same shift value and send it along to Alice.

System.out.println(CaesarCipher.encrypt("PEANUTS", 3)); // prints SHDQXWV

Frequency Analysis

The brute force approach seemed to work rather well. Mallory simply needed to glance at 25 strings, each the result of shifting characters in the ciphertext down the alphabet by an incrementing shift value. She could easily see the plaintext because it would be the only string that resembled the English language. But say our alphabet wasn't 26 characters long, say we were using some alien alphabet, which was hundreds of characters long- Mallory would potentially have to look at hundreds of strings to determine which one was the plaintext, and what if she was trying to decode hundreds of these messages, all with random shift values? Ain't nobody got time for that!

By comparing the frequency of characters that we'd expect to find in the language used with those in the ciphertext, we are able to calculate a best guess at the shift value used.

A chi-squared test is a way of measuring how likely it is that any difference between two sets of values occurred by chance. This happens to be very useful for calculating the shift value. We can have one set of values containing the expected number of occurrences of characters for a string the length of the ciphertext, and another set containing the actual number of occurrences. We can then shove the values into the chi-squared formula:

\(\sum_{i = A}^{i = Z}\frac{(C_{i} - E{i})^{2}}{E_{i}}\)

Here, \(C_{i}\) is the number of times the \(i\)th letter occurred in the ciphertext, and \(E_{i}\) is the expected number of times the \(i\)th letter should occur in a string whose length is that of the ciphertext.

We can call the decrypt function and make a note of the chi-square for increasing shift values. After we've tried all shifts, the most likely key length is the one that produced the lowest chi-squared value.

public static String frequencyAnalysis(String ciphertext) {
    ciphertext = ciphertext.replaceAll("[^a-zA-Z]", "");
    int bestShift = calculateShift(ciphertext);
    return decrypt(ciphertext, bestShift);
}

public static int calculateShift(String ciphertext) {
    ciphertext = ciphertext.replaceAll("[^a-zA-Z]", "");
    int shift = 0;
    double fitness = Integer.MAX_VALUE;
    for (int i = 0; i < LetterFrequencyUtils.ALPHABET_COUNT; i++) {
        // shift the ciphertext by i characters and compute the chi-square for the result
        double tempFitness = LetterFrequencyUtils.chiSquareAgainstEnglish(CaesarCipher.decrypt(ciphertext, i));
        // if the chi-square was lower than the previous value, make a note of it
        if (tempFitness < fitness) {
            fitness = tempFitness;
            shift = i;
        }
    }
    return shift;
}

To demonstrate the frequency analysis, in the following section I encrypt Douglas Adams' The Restaurant at the End of the Universe using a shift of 3, and then try to decrypt it without providing the shift value:

System.out.println("FREQUENCY ANALYSIS");
System.out.println("Encrypting The Restaurant at the End of the Universe...");
String bigText = CaesarCipher.readFile(CaesarCipher.class.getResource("/hitch2.txt"));
String bigCipherText = CaesarCipher.encrypt(bigText, 3);
System.out.println(bigCipherText.substring(0, 200) + "...");

System.out.println("Performing frequency analysis...");
String decryptedBigCipherText = CaesarCipher.frequencyAnalysis(bigCipherText);
System.out.println(decryptedBigCipherText.substring(0, 200) + "...");

Which prints:

FREQUENCY ANALYSIS
Encrypting The Restaurant at the End of the Universe...
GRXJODVDGDPVWKHUHVWDXUDQWDWWKHHQGRIWKHXQLYHUVHGRXJODVDGDPVWKHKLWFKKLNHUVJXLGHWRWKHJDODABGRXJODVDGDPVWKHUHVWDXUDQWDWWKHHQGRIWKHXQLYHUVHGRXJODVDGDPVOLIHWKHXQLYHUVHDQGHYHUBWKLQJGRXJODVDGDPVVRORQJDQGWKDQN...
Performing frequency analysis...
DOUGLASADAMSTHERESTAURANTATTHEENDOFTHEUNIVERSEDOUGLASADAMSTHEHITCHHIKERSGUIDETOTHEGALAXYDOUGLASADAMSTHERESTAURANTATTHEENDOFTHEUNIVERSEDOUGLASADAMSLIFETHEUNIVERSEANDEVERYTHINGDOUGLASADAMSSOLONGANDTHANK...

Since this method of deciphering the message is completely automatic, it can be very useful. However, if the message is a short string, it may not show frequencies that match the language. For example, the string "HELLOWORLD" has L as the most common letter, followed by O, and when encrypted with a shift of 3, and then decrypted using the frequency analysis method it becomes "EBIILTLOIA", which is clearly wrong.

Conclusion

Using this shift cipher can be a fun way of encrypting messages if you're just passing notes to a friend, but given that the ciphertext has a non-uniform distribution of characters it can easily be decrypted by comparing the frequency of characters in the ciphertext with those expected in the language used.

I've only included snippets of code here for brevity, but you can get all of it from the GitHub repo.

In the next post I'll be discussing the Vigenère cipher!

Previous post: Number Theory Part 2

Next post: Cryptanalysis - Vigenère Cipher