2008 November 5 / jjddaavviiss@@ccaarrlleettoonn..eedduu
Carleton College CS 201, Prof. Joshua R. Davis
When you use a web search engine to search for a keyword, you get several results, and each result shows you your keyword in context. For example, a Google search for "marmoset" yields results like this:
Marmoset - Wikipedia, the free encyclopediaThe contextual information for each page helps you decide whether that page will be useful to you.
Marmosets are New World monkeys of the genus Callithrix, which contains 18 species. The term marmoset is also used in reference to the Goeldi's Marmoset, ...
en.wikipedia.org/wiki/Marmoset - 44k - Cached - Similar pagesPaul's Marmoset Home Page
Welcome to my home page all about marmosets, those wonderful little creatures who become part of your family for many, many years. ...
www.angelfire.com/pa/reclinata/ - 51k - Cached - Similar pages
In this assignment you will practice with hash functions, measure the performance of a dictionary based on a hash table, and then use your dictionary to implement a search-with-context system.
These are some files you might need while working on this assignment.
HashTable
class.
Dictionary
Using Hashing
The file dicthasbst.py implements a Dictionary
ADT using BinarySearchTree
. Now, in a file dictishash.py, implement this same Dictionary
ADT as a subclass of the HashTable
class defined in hashtable.py. As always, you should make proper use of HashTable
as a superclass. In particular, you should not use HashTable
's data directly; you should use only its methods.
Your new Dictionary
should be able to handle three types of keys: int
, float
, and str
. int
s are Python numbers that store integer values such as 12, 16348, 0, and -1. float
s are Python numbers that store real-number values such as 1.218332 and -7088995.0. str
s are Python strings. You can test whether a given key is of type int
, float
, or str
using code like
if type(key) == str:
Your Dictionary
class will need to override HashTable
's hash()
method so that it tests the type of the key and hashes accordingly. For each key type, try a couple of hash functions. You may not just use Python's built-in hash()
function. In a file hashdiscussion.txt, explain how your hash function works, what other hash functions you have tried, and why you chose the one you did.
Remark: In this exercise we are using hashing in two ways. First, we are handling new key types such as strings using a hash function. Second, we are using a hash table for our underlying data structure. These two uses of hashing have little to do with each other. I could ask you to write a new Dictionary
based on BinarySearchTree
that uses a hash function to handle various kinds of keys. (I am not asking you to do that, because it has meager educational value.) Alternatively, I could ask you to implement Dictionary
using a hash table but not require you to handle float
or str
keys. (I am not asking you to do that either.) Make sure that you understand the distinction here.
Dictionary
's Performance
HashTable
is fairly efficient, except on those occasions when a costly resizing operation becomes necessary. The same should be true of your Dictionary
class based on HashTable
. Design and run a timing test that clearly demonstrates the performance characteristics of your Dictionary
class. In the same file hashdiscussion.txt mentioned above, describe/list your test code, describe/show results, and discuss.
In this part of the assignment you will use your new Dictionary
class to implement search with context, as described above. For example, when my version of the program searches our Color Histograms assignment page for the keyword "wrapper", it produces
... anyway . 2 . Design And Implement A Wrapper Class For CountingTree As a data structure , ... the file colorhistograms . py , create a wrapper class for CountingTree . The rest of your program will be using this wrapper class ( and not the CountingTree class ) ... make it useful and understandable . Since your wrapper class encapsulates all of the functionality of CountingTree ... - commented ( 1 point ) . The wrapper class is correct ( 3 points ) and ...The formatting is admittedly not great. You can probably imagine some improvements.
For this part of the assignment, edit search.py so that it fetches a web page, builds a search database for it, and prints out the search-with-context results for a given keyword. Here are the basic ingredients, as seen in search.py.
urllib
library lets me fetch any web page I want as an HTML file.
removeHTMLTags()
function strips all HTML markup from the web page to produce human-readable, plain text. This function doesn't do much error checking, so it may fail on badly formatted pages, which are abundant on the web. Python comes with a couple of facilities for parsing HTML, which should do the job much better than my removeHTMLTags()
, but they were complicated enough that I found it easier to write my own function.
scan
function scans the plain text into tokens. The tokens are words (blocks of continguous letters), numbers (blocks of contiquous digits), and all other symbols (each its own token). Experiment with scan
, to see how it works. It doesn't handle everything perfectly; for example, "I've" gets scanned into three tokens.
isInteresting(token)
that returns True
or False
based on some criteria:
/usr/share/dict/connectives
(on any Mac).
isInteresting()
function. You may find other word files in /usr/share/dict/
useful; you may also want to check out Python string methods.
Dictionary
as a keyword search database. The keys to the dictionary are the interesting tokens; for each key, the associated value is a list of numbers indicating where in the token list that search term occurs. For example, if "marmoset" occurs three times in the file, at token 14, token 51, and token 64, then the dictionary's value for 'marmoset'
is [14, 51, 64]
. I build this database by processing the token list, testing whether each token is interesting, and updating the dictionary for each interesting token. I did not need to subclass Dictionary
to accomplish this.
There is plenty of room here for going beyond the bare requirements. You could write a really good isInteresting()
, improve removeHTMLTags()
or scan()
, make better search-with-context output than mine, etc. As is explained below, some of the points on this assignment are reserved for this kind of creative effort.
Submit your dictishash.py, hashdiscussion.txt, and search.py files electronically by Monday 11:59 PM. They will be graded based on the following criteria.
Dictionary
works and subclasses properly (3 points).
Dictionary
are clearly demonstrated (3 points).