2009 May 12 / |j|d|a|v|i|s|@|c|a|r|l|e|t|o|n|.|e|d|u

Search With Context

Carleton College CS 201, Prof. Joshua R. Davis

When you use a web search engine to search for a keyword, you often get several results, and each result shows you your keyword in context. For example, a Google search for "marmoset" yields results like this:

Marmoset - Wikipedia, the free encyclopedia
Apr 23, 2009 ... Marmosets are New World monkeys of the genus Callithrix, which contains 18 species. The term marmoset is also used in reference to the ...
en.wikipedia.org/wiki/Marmoset - 48k - Cached - Similar pages

Paul's Marmoset Home Page
Welcome to my home page all about marmosets, those wonderful little creatures who become part of your family for many, many years. ...
www.angelfire.com/pa/reclinata/ - 51k - Cached - Similar pages

Here's how this works. Google is continually crawling the web. For each web page it finds, it records all of the interesting words on the page as well as where they occur on the page. Then, when you search for a word, Google finds all of the pages with that word, ranks them by importance/relevance, and for each one shows the search word in context. The context is essentially some fixed number of words or characters surrounding the search word. The context helps you decide which pages will really be useful to you.

To understand all of this requires several skills. The crawling part is related to graph searches, which we're about to study in our course. The importance/relevance part uses linear algebra and probability theory; we'll leave those to math courses. In this assignment, we're focused on the problem of recording interesting words and displaying them in context, for a single web page.

Specifically, you will write a program that fetches a web page, builds a dictionary of search terms for that web page, searches the page for a particular search term, and displays the search result in context. For example, when my version of the program searches our Color Histograms assignment for the word "histogram", it prints this output:

...e program itself) to generate a histogram that shows the spread of colors.... On the right end of the red histogram there is a large spike, which i...omatically, so that the tallest histogram uses the whole vertical space b...
I have the program set to display 32 characters around the search word. Notice that I insert ellipses ("...") between the various occurrences of the search word, to indicate omitted text. Similarly, when I search for "points" I get this output.
...riteria: Enumeration works. (3 points) The histograms work. (6 points) Subclasses are well-written. M...s are not accessed directly. (3 points) All of the code is good Python..., organized, and documented. (3 points) ...
Notice that, because the word "points" occurs repeatedly in quick succession, the 32-character contexts around the occurrences overlap, and my program handles this gracefully. Your output doesn't have to look like mine, but you should take some care to make it look decent.

You may complete this assignment with one partner. It is due Sunday 11:59 PM.

0. Prepare

These are some files you might need.

Take a moment right now to read through search.py. The comments should explain what's going on, even if the code itself is incomprehensible to you (because it relies heavily on regular expressions). In fact, run the program right now. It prints out a plain text version of the web page. It also prints a list of pairs of indices, which you can decipher as follows.

The first pair of indices is (2, 5). It indicates that the first word in the text occurs between characters 2 and 5. More precisely, it indicates that the first word is text[2:5], which is JRD. Verify for yourself that this is the first word in the text. Also go to the web page and verify that this is its first word. (It's in the title of the web page, at the top of the window in a typical browser.)

Similarly, the next pair of indices is (7, 9), which indicates the word CS, and the next pair is (15, 20), which indicates the word Color. Numerals such as 201 and punctuation such as : are not counted as words.

1. Implement Dictionary Using A Hash Table

As you know, the file dictionarybst.py implements a Dictionary ADT using a BinarySearchTree. Now, in the file search.py you will implement the same Dictionary as a subclass of HashTable. There are really two parts to this task.

First, you must implement methods __setitem__(), __getitem__(), __delitem__(), and __len__(), so that the dictionary is easy to use. They should behave identically to how they behave in dictionarybst.py. This is not difficult.

Second, your new Dictionary should be able to handle three types of keys: int, float, and str. ints are Python numbers that store integer values such as 12, 16348, 0, and -1. floats are Python numbers that store real-number values such as 1.218332 and -7088995.0. strs are Python strings. You can test whether a given key is of type int, float, or str using code like

if type(key) == str:
Your Dictionary class will need to override HashTable's hash() method so that it tests the type of the key and hashes accordingly. You may not just use Python's built-in hash() function; you must devise your own hash function. You may want to try a vew variations and pick the one that performs best. (See the next section.)

Remark: In this exercise we are using hashing in two ways. First, we are handling new key types such as strings using a hash function. Second, we are using a hash table for our underlying data structure. These two uses of hashing have little to do with each other. I could ask you to write a new Dictionary based on BinarySearchTree that uses a hash function to handle various kinds of keys. (I am not asking you to do that, because it has meager educational value.) Alternatively, I could ask you to implement Dictionary using a hash table but not require you to handle float or str keys. (I am not asking you to do that either.) Make sure that you understand the distinction here.

2. Demonstrate Your Dictionary's Performance

The efficiency of your hash-table-based dictionary depends somewhat on how good your hash function is. Even if the hash function is uniform, your dictionary will still act slowly whenever the hash table needs to resize. That is, most putting operations should be fast, but occasionally putting should be slow.

Design and run a timing test that clearly demonstrates the performance characteristics of your Dictionary class. In a file searchdiscussion.txt list your test code, show your test data, and explain everything.

3. Build The Basic Search-With-Context System

In this part of the assignment you will use your new Dictionary class to implement search with context, as described above. Here are the basic ingredients.

The code already in main() fetches the web page for you, processes the text in various ways, and obtains a list of pairs of indices. Each of these pairs corresponds to a word in the text. For each word, do the following.

After you have processed all of the words on the web page, your search database is complete. Then, to search for a particular word, you just ask the dictionary for the list of all occurrences of that word; using those occurrences, print the search word with context.

4. Improve Your System

As you can see below, some of the credit on this assignment is reserved for going beyond the bare requirements to produce an exceptional program. Here are some ideas.

If you wish to be considered for these points, then add a note in searchdiscussion.txt explaining what you did that deserves special consideration.

5. Submit Your Work

Submit your searchdiscussion.txt and search.py files electronically by Sunday 11:59 PM. They will be graded based on the following criteria.