2008 November 5 / jjddaavviiss@@ccaarrlleettoonn..eedduu
Carleton College CS 201, Prof. Joshua R. Davis
When you use a web search engine to search for a keyword, you get several results, and each result shows you your keyword in context. For example, a Google search for "marmoset" yields results like this:
Marmoset - Wikipedia, the free encyclopediaThe contextual information for each page helps you decide whether that page will be useful to you.
Marmosets are New World monkeys of the genus Callithrix, which contains 18 species. The term marmoset is also used in reference to the Goeldi's Marmoset, ...
en.wikipedia.org/wiki/Marmoset - 44k - Cached - Similar pages
Paul's Marmoset Home Page
Welcome to my home page all about marmosets, those wonderful little creatures who become part of your family for many, many years. ...
www.angelfire.com/pa/reclinata/ - 51k - Cached - Similar pages
In this assignment you will practice with hash functions, measure the performance of a dictionary based on a hash table, and then use your dictionary to implement a search-with-context system.
These are some files you might need while working on this assignment.
The file dicthasbst.py implements a
Dictionary ADT using
BinarySearchTree. Now, in a file dictishash.py, implement this same
Dictionary ADT as a subclass of the
HashTable class defined in hashtable.py. As always, you should make proper use of
HashTable as a superclass. In particular, you should not use
HashTable's data directly; you should use only its methods.
Dictionary should be able to handle three types of keys:
ints are Python numbers that store integer values such as 12, 16348, 0, and -1.
floats are Python numbers that store real-number values such as 1.218332 and -7088995.0.
strs are Python strings. You can test whether a given key is of type
str using code like
if type(key) == str:
Dictionary class will need to override
hash() method so that it tests the type of the key and hashes accordingly. For each key type, try a couple of hash functions. You may not just use Python's built-in
hash() function. In a file hashdiscussion.txt, explain how your hash function works, what other hash functions you have tried, and why you chose the one you did.
Remark: In this exercise we are using hashing in two ways. First, we are handling new key types such as strings using a hash function. Second, we are using a hash table for our underlying data structure. These two uses of hashing have little to do with each other. I could ask you to write a new
Dictionary based on
BinarySearchTree that uses a hash function to handle various kinds of keys. (I am not asking you to do that, because it has meager educational value.) Alternatively, I could ask you to implement
Dictionary using a hash table but not require you to handle
str keys. (I am not asking you to do that either.) Make sure that you understand the distinction here.
HashTable is fairly efficient, except on those occasions when a costly resizing operation becomes necessary. The same should be true of your
Dictionary class based on
HashTable. Design and run a timing test that clearly demonstrates the performance characteristics of your
Dictionary class. In the same file hashdiscussion.txt mentioned above, describe/list your test code, describe/show results, and discuss.
In this part of the assignment you will use your new
Dictionary class to implement search with context, as described above. For example, when my version of the program searches our Color Histograms assignment page for the keyword "wrapper", it produces
... anyway . 2 . Design And Implement A Wrapper Class For CountingTree As a data structure , ... the file colorhistograms . py , create a wrapper class for CountingTree . The rest of your program will be using this wrapper class ( and not the CountingTree class ) ... make it useful and understandable . Since your wrapper class encapsulates all of the functionality of CountingTree ... - commented ( 1 point ) . The wrapper class is correct ( 3 points ) and ...The formatting is admittedly not great. You can probably imagine some improvements.
For this part of the assignment, edit search.py so that it fetches a web page, builds a search database for it, and prints out the search-with-context results for a given keyword. Here are the basic ingredients, as seen in search.py.
urlliblibrary lets me fetch any web page I want as an HTML file.
removeHTMLTags()function strips all HTML markup from the web page to produce human-readable, plain text. This function doesn't do much error checking, so it may fail on badly formatted pages, which are abundant on the web. Python comes with a couple of facilities for parsing HTML, which should do the job much better than my
removeHTMLTags(), but they were complicated enough that I found it easier to write my own function.
scanfunction scans the plain text into tokens. The tokens are words (blocks of continguous letters), numbers (blocks of contiquous digits), and all other symbols (each its own token). Experiment with
scan, to see how it works. It doesn't handle everything perfectly; for example, "I've" gets scanned into three tokens.
Falsebased on some criteria:
/usr/share/dict/connectives(on any Mac).
isInteresting()function. You may find other word files in
/usr/share/dict/useful; you may also want to check out Python string methods.
Dictionaryas a keyword search database. The keys to the dictionary are the interesting tokens; for each key, the associated value is a list of numbers indicating where in the token list that search term occurs. For example, if "marmoset" occurs three times in the file, at token 14, token 51, and token 64, then the dictionary's value for
[14, 51, 64]. I build this database by processing the token list, testing whether each token is interesting, and updating the dictionary for each interesting token. I did not need to subclass
Dictionaryto accomplish this.
There is plenty of room here for going beyond the bare requirements. You could write a really good
scan(), make better search-with-context output than mine, etc. As is explained below, some of the points on this assignment are reserved for this kind of creative effort.
Submit your dictishash.py, hashdiscussion.txt, and search.py files electronically by Monday 11:59 PM. They will be graded based on the following criteria.
Dictionaryworks and subclasses properly (3 points).
Dictionaryare clearly demonstrated (3 points).