2008 November 5 / jjddaavviiss@@ccaarrlleettoonn..eedduu

Search With Context

Carleton College CS 201, Prof. Joshua R. Davis

When you use a web search engine to search for a keyword, you get several results, and each result shows you your keyword in context. For example, a Google search for "marmoset" yields results like this:

Marmoset - Wikipedia, the free encyclopedia
Marmosets are New World monkeys of the genus Callithrix, which contains 18 species. The term marmoset is also used in reference to the Goeldi's Marmoset, ...
en.wikipedia.org/wiki/Marmoset - 44k - Cached - Similar pages
Paul's Marmoset Home Page
Welcome to my home page all about marmosets, those wonderful little creatures who become part of your family for many, many years. ...
www.angelfire.com/pa/reclinata/ - 51k - Cached - Similar pages

The contextual information for each page helps you decide whether that page will be useful to you.

In this assignment you will practice with hash functions, measure the performance of a dictionary based on a hash table, and then use your dictionary to implement a search-with-context system.

0. Obtain Files

These are some files you might need while working on this assignment.

dicthasbst.py is our implementation of a dictionary using a binary search tree.
hashtable.py is our auto-resizing, chaining HashTable class.
search.py contains some text-processing code already. You'll be adding to this file.

You'll also be editing files dictishash.py and hashdiscussion.txt.

1. Implement `Dictionary` Using Hashing

The file dicthasbst.py implements a Dictionary ADT using BinarySearchTree. Now, in a file dictishash.py, implement this same Dictionary ADT as a subclass of the HashTable class defined in hashtable.py. As always, you should make proper use of HashTable as a superclass. In particular, you should not use HashTable's data directly; you should use only its methods.

Your new Dictionary should be able to handle three types of keys: int, float, and str. ints are Python numbers that store integer values such as 12, 16348, 0, and -1. floats are Python numbers that store real-number values such as 1.218332 and -7088995.0. strs are Python strings. You can test whether a given key is of type int, float, or str using code like

if type(key) == str:

Your Dictionary class will need to override HashTable's hash() method so that it tests the type of the key and hashes accordingly. For each key type, try a couple of hash functions. You may not just use Python's built-in hash() function. In a file hashdiscussion.txt, explain how your hash function works, what other hash functions you have tried, and why you chose the one you did.

Remark: In this exercise we are using hashing in two ways. First, we are handling new key types such as strings using a hash function. Second, we are using a hash table for our underlying data structure. These two uses of hashing have little to do with each other. I could ask you to write a new Dictionary based on BinarySearchTree that uses a hash function to handle various kinds of keys. (I am not asking you to do that, because it has meager educational value.) Alternatively, I could ask you to implement Dictionary using a hash table but not require you to handle float or str keys. (I am not asking you to do that either.) Make sure that you understand the distinction here.

2. Demonstrate Your `Dictionary`'s Performance

HashTable is fairly efficient, except on those occasions when a costly resizing operation becomes necessary. The same should be true of your Dictionary class based on HashTable. Design and run a timing test that clearly demonstrates the performance characteristics of your Dictionary class. In the same file hashdiscussion.txt mentioned above, describe/list your test code, describe/show results, and discuss.

3. Build A Search-With-Context System

In this part of the assignment you will use your new Dictionary class to implement search with context, as described above. For example, when my version of the program searches our Color Histograms assignment page for the keyword "wrapper", it produces

... anyway . 2 . Design And Implement A Wrapper Class For CountingTree As a data structure , ... the file colorhistograms . py , create a wrapper class for CountingTree . The rest of your program will be using this wrapper class ( and not the CountingTree class ) ... make it useful and understandable . Since your wrapper class encapsulates all of the functionality of CountingTree ... - commented ( 1 point ) . The wrapper class is correct ( 3 points ) and ...

The formatting is admittedly not great. You can probably imagine some improvements.

For this part of the assignment, edit search.py so that it fetches a web page, builds a search database for it, and prints out the search-with-context results for a given keyword. Here are the basic ingredients, as seen in search.py.

Python's urllib library lets me fetch any web page I want as an HTML file.
The removeHTMLTags() function strips all HTML markup from the web page to produce human-readable, plain text. This function doesn't do much error checking, so it may fail on badly formatted pages, which are abundant on the web. Python comes with a couple of facilities for parsing HTML, which should do the job much better than my removeHTMLTags(), but they were complicated enough that I found it easier to write my own function.
The scan function scans the plain text into tokens. The tokens are words (blocks of continguous letters), numbers (blocks of contiquous digits), and all other symbols (each its own token). Experiment with scan, to see how it works. It doesn't handle everything perfectly; for example, "I've" gets scanned into three tokens.
Some of the tokens are interesting — stuff that a user might want to search for — and others are not. I wrote a function isInteresting(token) that returns True or False based on some criteria:
- I decided that numbers and punctuation are not interesting. That leaves just words.
- I decided that extremely common words such as "the" are not interesting. I found a list of common words in the file /usr/share/dict/connectives (on any Mac).
You need write your own isInteresting() function. You may find other word files in /usr/share/dict/ useful; you may also want to check out Python string methods.
Once I have the list of tokens from the file, I use my Dictionary as a keyword search database. The keys to the dictionary are the interesting tokens; for each key, the associated value is a list of numbers indicating where in the token list that search term occurs. For example, if "marmoset" occurs three times in the file, at token 14, token 51, and token 64, then the dictionary's value for 'marmoset' is [14, 51, 64]. I build this database by processing the token list, testing whether each token is interesting, and updating the dictionary for each interesting token. I did not need to subclass Dictionary to accomplish this.
Now, to do a search-with-context for "wrapper" I simply look up where it occurs in the token list, and then I print out all of the tokens within an 8-token radius of each occurrence of the "wrapper" token. The number 8 is easy to change in my implementation. When there are multiple occurrences, I handle both overlapping and non-overlapping contexts gracefully. When there are no occurrences, I handle that gracefully with a polite message to the user.

There is plenty of room here for going beyond the bare requirements. You could write a really good isInteresting(), improve removeHTMLTags() or scan(), make better search-with-context output than mine, etc. As is explained below, some of the points on this assignment are reserved for this kind of creative effort.

4. Submit Your Work Electronically

Submit your dictishash.py, hashdiscussion.txt, and search.py files electronically by Monday 11:59 PM. They will be graded based on the following criteria.

Dictionary works and subclasses properly (3 points).
There is a good discussion of hash function choice (3 points).
The performance characteristics of Dictionary are clearly demonstrated (3 points).
Your search-with-context system works (3 points) and is clear/well-commented (1 point).
Additional credit (3 points) will be awarded for going beyond the bare requirements. To be considered for these points, please make a note in discussion.txt indicating exactly which aspects of your work deserve consideration, and why. If you do not make such a note, then you will not be considered for the points.