2009 May 12 / |j|d|a|v|i|s|@|c|a|r|l|e|t|o|n|.|e|d|u

Search With Context

Carleton College CS 201, Prof. Joshua R. Davis

When you use a web search engine to search for a keyword, you often get several results, and each result shows you your keyword in context. For example, a Google search for "marmoset" yields results like this:

Marmoset - Wikipedia, the free encyclopedia
Apr 23, 2009 ... Marmosets are New World monkeys of the genus Callithrix, which contains 18 species. The term marmoset is also used in reference to the ...
en.wikipedia.org/wiki/Marmoset - 48k - Cached - Similar pages
Paul's Marmoset Home Page
Welcome to my home page all about marmosets, those wonderful little creatures who become part of your family for many, many years. ...
www.angelfire.com/pa/reclinata/ - 51k - Cached - Similar pages

Here's how this works. Google is continually crawling the web. For each web page it finds, it records all of the interesting words on the page as well as where they occur on the page. Then, when you search for a word, Google finds all of the pages with that word, ranks them by importance/relevance, and for each one shows the search word in context. The context is essentially some fixed number of words or characters surrounding the search word. The context helps you decide which pages will really be useful to you.

To understand all of this requires several skills. The crawling part is related to graph searches, which we're about to study in our course. The importance/relevance part uses linear algebra and probability theory; we'll leave those to math courses. In this assignment, we're focused on the problem of recording interesting words and displaying them in context, for a single web page.

Specifically, you will write a program that fetches a web page, builds a dictionary of search terms for that web page, searches the page for a particular search term, and displays the search result in context. For example, when my version of the program searches our Color Histograms assignment for the word "histogram", it prints this output:

...e program itself) to generate a histogram that shows the spread of colors.... On the right end of the red histogram there is a large spike, which i...omatically, so that the tallest histogram uses the whole vertical space b...

I have the program set to display 32 characters around the search word. Notice that I insert ellipses ("...") between the various occurrences of the search word, to indicate omitted text. Similarly, when I search for "points" I get this output.

...riteria: Enumeration works. (3 points) The histograms work. (6 points) Subclasses are well-written. M...s are not accessed directly. (3 points) All of the code is good Python..., organized, and documented. (3 points) ...

Notice that, because the word "points" occurs repeatedly in quick succession, the 32-character contexts around the occurrences overlap, and my program handles this gracefully. Your output doesn't have to look like mine, but you should take some care to make it look decent.

You may complete this assignment with one partner. It is due Sunday 11:59 PM.

0. Prepare

These are some files you might need.

dictionarybst.py is our implementation of a dictionary using a binary search tree. You won't actually use this code, but you may find it helpful in writing your own dictionary class.
hashtable.py is our auto-resizing, chaining HashTable class. You'll write a dictionary based on it.
search.py is the search-with-context program that you'll be editing and handing in.

Take a moment right now to read through search.py. The comments should explain what's going on, even if the code itself is incomprehensible to you (because it relies heavily on regular expressions). In fact, run the program right now. It prints out a plain text version of the web page. It also prints a list of pairs of indices, which you can decipher as follows.

The first pair of indices is (2, 5). It indicates that the first word in the text occurs between characters 2 and 5. More precisely, it indicates that the first word is text[2:5], which is JRD. Verify for yourself that this is the first word in the text. Also go to the web page and verify that this is its first word. (It's in the title of the web page, at the top of the window in a typical browser.)

Similarly, the next pair of indices is (7, 9), which indicates the word CS, and the next pair is (15, 20), which indicates the word Color. Numerals such as 201 and punctuation such as : are not counted as words.

1. Implement `Dictionary` Using A Hash Table

As you know, the file dictionarybst.py implements a Dictionary ADT using a BinarySearchTree. Now, in the file search.py you will implement the same Dictionary as a subclass of HashTable. There are really two parts to this task.

First, you must implement methods __setitem__(), __getitem__(), __delitem__(), and __len__(), so that the dictionary is easy to use. They should behave identically to how they behave in dictionarybst.py. This is not difficult.

Second, your new Dictionary should be able to handle three types of keys: int, float, and str. ints are Python numbers that store integer values such as 12, 16348, 0, and -1. floats are Python numbers that store real-number values such as 1.218332 and -7088995.0. strs are Python strings. You can test whether a given key is of type int, float, or str using code like

if type(key) == str:

Your Dictionary class will need to override HashTable's hash() method so that it tests the type of the key and hashes accordingly. You may not just use Python's built-in hash() function; you must devise your own hash function. You may want to try a vew variations and pick the one that performs best. (See the next section.)

Remark: In this exercise we are using hashing in two ways. First, we are handling new key types such as strings using a hash function. Second, we are using a hash table for our underlying data structure. These two uses of hashing have little to do with each other. I could ask you to write a new Dictionary based on BinarySearchTree that uses a hash function to handle various kinds of keys. (I am not asking you to do that, because it has meager educational value.) Alternatively, I could ask you to implement Dictionary using a hash table but not require you to handle float or str keys. (I am not asking you to do that either.) Make sure that you understand the distinction here.

2. Demonstrate Your `Dictionary`'s Performance

The efficiency of your hash-table-based dictionary depends somewhat on how good your hash function is. Even if the hash function is uniform, your dictionary will still act slowly whenever the hash table needs to resize. That is, most putting operations should be fast, but occasionally putting should be slow.

Design and run a timing test that clearly demonstrates the performance characteristics of your Dictionary class. In a file searchdiscussion.txt list your test code, show your test data, and explain everything.

3. Build The Basic Search-With-Context System

In this part of the assignment you will use your new Dictionary class to implement search with context, as described above. Here are the basic ingredients.

The code already in main() fetches the web page for you, processes the text in various ways, and obtains a list of pairs of indices. Each of these pairs corresponds to a word in the text. For each word, do the following.

If the word is too common — such as "the" — then throw it out. You have to decide which words are too common. For starters, I found a list of common words on my Mac OS X computer in the file /usr/share/dict/connectives. You may find other word files in /usr/share/dict/ useful. You may also want to check out Python string methods.
If the word is not too common, then add it to your dictionary. The word is the key. The value associated to that key should be a list. A list of what? Of all occurrences of that word in the web page. Exactly how you record the occurrences is up to you.

After you have processed all of the words on the web page, your search database is complete. Then, to search for a particular word, you just ask the dictionary for the list of all occurrences of that word; using those occurrences, print the search word with context.

4. Improve Your System

As you can see below, some of the credit on this assignment is reserved for going beyond the bare requirements to produce an exceptional program. Here are some ideas.

You could invest extra time in coming up with good hash functions. Describe them and explain why you settled on your final choice.
You could implement Dictionary's __str__() method to mimic the built-in Python dictionary (as the Dictionary in dictionarybst.py does).
You could invest extra time in coming up with good criteria for excluding boring words from your search database.
You could improve the formatting of your search result output somehow.
You could add a feature that answers the question: What are the most important words on this web page?
Come up with your own ideas!

If you wish to be considered for these points, then add a note in searchdiscussion.txt explaining what you did that deserves special consideration.

5. Submit Your Work

Submit your searchdiscussion.txt and search.py files electronically by Sunday 11:59 PM. They will be graded based on the following criteria.

Dictionary works and is well-written. (3 points)
Your timing results and discussion are illustrative, informative, and insightful. (3 points)
Your basic search-with-context system works and is well-written. (6 points)
You have demonstrated creativity, polish, and/or unusual diligence, as argued in searchdiscussion.txt. (3 points)