2010 January 25 / j d a v i s @ c a r l e t o n . e d u

Frequency Analysis

In this lab we continue our discussion of breaking the substitution cipher by frequency analysis. There are many questions and code chunks scattered throughout the lab; make sure that you understand all of them.

Substitution Cipher

First, download substitution.py to your computer and read it to make sure you understand everything. Fortunately, the code has been cleaned up a bit since Friday.

Sorting Lists

I'm about to give you some code that involves sorting a list. To prepare for it, study this code first.

import operator
mylist = [('a', 5, "jan"), ('g', -1, "bob"), ('c', 3, "juan"), ('p', 7, "amy")]
mylist
mylist.sort()
mylist
mylist.sort(key=operator.itemgetter(0))
mylist
mylist.sort(key=operator.itemgetter(1))
mylist
mylist.sort(key=operator.itemgetter(2))
mylist

Letter Frequency Analysis

Frequency analysis of ciphertext boils down to knowing which letters (or words, or phrases, or...) are most commonly used in the (English) language. Python doesn't know such facts by default; you have to tell it. For this we need a sample text. Download thewaroftheworlds.txt to your computer. I don't suggest that you read it right now; it's big. Here's how you make Python load the whole text file into a string.

file = open("thewaroftheworlds.txt")
text = file.read()

How many characters are in the file? How many words are in the file?

Now read the following function. How exactly does it work? What exactly does it produce? I have stripped all of the comments out of it, so that you have to think harder.

def frequencyList(text):
    """Given a string of text, returns a frequency list describing the occurrence of the 26 English letters (uppercase and lowercase) in that text."""
    freqs = [0] * 26
    for i in range(len(text)):
        index = substitution.indexForLetter(text[i])
        if index != None:
            freqs[index] += 1
    total = float(sum(freqs))
    for i in range(26):
        freqs[i] = (substitution.letterForIndex(i), freqs[i] / total)
    freqs.sort(key=operator.itemgetter(1))
    return freqs

Based on thewaroftheworlds.txt, what are the most frequently used letters in the language, and what are the least frequently used letters?

Here's another function for you to understand. This time I've left in the comments, to make things easier.

def suspectedSubstitution(c, langFreqList):
    """Given a ciphertext and a frequency list for the relevant language (English), returns a substitution list that is suspected to have produced the ciphertext."""
    # Analyze character frequencies in the encrypted message.
    freqList = frequencyList(c)
    # Pair characters according to their frequencies.
    pairs = []
    for i in range(26):
        pairs.append((freqList[i][0], langFreqList[i][0]))
    pairs.sort(key=operator.itemgetter(1))
    # Build the suspected substitution list.
    subst = []
    for i in range(26):
        subst.append(pairs[i][0])
    return subst

Use this function to decrypt a ciphertext, such as the encrypted "ITS A FACT THE EGREGIOUS TAHITIAN FELLOW WAS SHEARED BY A SHINTO TEEN NAMED CLINT PMUDDRRRSHHOOOONNTEE", the encrypted text in substitution.py, or an encrypted text of your own creation. How well does it work? Answer: Not very well at all. Why?

Word Frequency Analysis

Here's an observation: There are only two common one-letter words in English. Intuitively, how could we exploit this knowledge in our frequency analysis of substitution ciphertext?

There are many more two-letter words than one-letter words. It would be difficult to list them all reliably from memory. Fortunately, there is list of words already on your computer, in the file at /usr/share/dict/words:

file = open("/usr/share/dict/words")
text = file.read()
print text

Write some Python code that uses this file to produce a list of all of the two-letter words in English. Intuitively, how could we exploit this knowledge in improving our frequency analysis of substitution ciphertext?

Dictionaries

This last part is about the substitution cipher, but not about frequency analysis of it. Your assigned reading for today included the notion of Python dictionaries. A dictionary is kind of like a list, in that it's an object whose entire purpose is to store other objects. But, whereas the objects in a list mylist are always indexed from 0 up to len(mylist) - 1, the objects in a dictionary can be indexed by almost any kind of Python object whatsoever.

Complete the following code to produce a function that takes in a dictionary and returns a new copy of the dictionary.

def dictionaryCopy(d):
    new = {}
    keys = d.keys()
    for i in range(len(keys)):
        # !! fill in this line
    return new

Now write a similar function, called dictionaryInverse(), that takes in a dictionary and returns an inverse dictionary (which is just like the original dictionary, with the roles of keys and values reversed).

I claim that substitution.py would be easier to write and easier to improve upon if we used substitution dictionaries rather than substitution lists. Explain.