2009 May 12 / |j|d|a|v|i|s|@|c|a|r|l|e|t|o|n|.|e|d|u
Carleton College CS 201, Prof. Joshua R. Davis
When you use a web search engine to search for a keyword, you often get several results, and each result shows you your keyword in context. For example, a Google search for "marmoset" yields results like this:
Marmoset - Wikipedia, the free encyclopediaHere's how this works. Google is continually crawling the web. For each web page it finds, it records all of the interesting words on the page as well as where they occur on the page. Then, when you search for a word, Google finds all of the pages with that word, ranks them by importance/relevance, and for each one shows the search word in context. The context is essentially some fixed number of words or characters surrounding the search word. The context helps you decide which pages will really be useful to you.
Apr 23, 2009 ... Marmosets are New World monkeys of the genus Callithrix, which contains 18 species. The term marmoset is also used in reference to the ...
en.wikipedia.org/wiki/Marmoset - 48k - Cached - Similar pagesPaul's Marmoset Home Page
Welcome to my home page all about marmosets, those wonderful little creatures who become part of your family for many, many years. ...
www.angelfire.com/pa/reclinata/ - 51k - Cached - Similar pages
To understand all of this requires several skills. The crawling part is related to graph searches, which we're about to study in our course. The importance/relevance part uses linear algebra and probability theory; we'll leave those to math courses. In this assignment, we're focused on the problem of recording interesting words and displaying them in context, for a single web page.
Specifically, you will write a program that fetches a web page, builds a dictionary of search terms for that web page, searches the page for a particular search term, and displays the search result in context. For example, when my version of the program searches our Color Histograms assignment for the word "histogram", it prints this output:
...e program itself) to generate a histogram that shows the spread of colors.... On the right end of the red histogram there is a large spike, which i...omatically, so that the tallest histogram uses the whole vertical space b...
I have the program set to display 32 characters around the search word. Notice that I insert ellipses ("...") between the various occurrences of the search word, to indicate omitted text. Similarly, when I search for "points" I get this output.
...riteria: Enumeration works. (3 points) The histograms work. (6 points) Subclasses are well-written. M...s are not accessed directly. (3 points) All of the code is good Python..., organized, and documented. (3 points) ...
Notice that, because the word "points" occurs repeatedly in quick succession, the 32-character contexts around the occurrences overlap, and my program handles this gracefully. Your output doesn't have to look like mine, but you should take some care to make it look decent.
You may complete this assignment with one partner. It is due Sunday 11:59 PM.
These are some files you might need.
dictionarybst.py
is our implementation of a dictionary using a binary search tree. You won't actually use this code, but you may find it helpful in writing your own dictionary class.
hashtable.py
is our auto-resizing, chaining HashTable
class. You'll write a dictionary based on it.
search.py
is the search-with-context program that you'll be editing and handing in.
search.py
. The comments should explain what's going on, even if the code itself is incomprehensible to you (because it relies heavily on regular expressions). In fact, run the program right now. It prints out a plain text version of the web page. It also prints a list of pairs of indices, which you can decipher as follows.
The first pair of indices is (2, 5). It indicates that the first word in the text occurs between characters 2 and 5. More precisely, it indicates that the first word is text[2:5]
, which is JRD
. Verify for yourself that this is the first word in the text. Also go to the web page and verify that this is its first word. (It's in the title of the web page, at the top of the window in a typical browser.)
Similarly, the next pair of indices is (7, 9), which indicates the word CS
, and the next pair is (15, 20), which indicates the word Color
. Numerals such as 201
and punctuation such as :
are not counted as words.
Dictionary
Using A Hash Table
As you know, the file dictionarybst.py
implements a Dictionary
ADT using a BinarySearchTree
. Now, in the file search.py
you will implement the same Dictionary
as a subclass of HashTable
. There are really two parts to this task.
First, you must implement methods __setitem__()
, __getitem__()
, __delitem__()
, and __len__()
, so that the dictionary is easy to use. They should behave identically to how they behave in dictionarybst.py
. This is not difficult.
Second, your new Dictionary
should be able to handle three types of keys: int
, float
, and str
. int
s are Python numbers that store integer values such as 12, 16348, 0, and -1. float
s are Python numbers that store real-number values such as 1.218332 and -7088995.0. str
s are Python strings. You can test whether a given key is of type int
, float
, or str
using code like
if type(key) == str:
Your Dictionary
class will need to override HashTable
's hash()
method so that it tests the type of the key and hashes accordingly. You may not just use Python's built-in hash()
function; you must devise your own hash function. You may want to try a vew variations and pick the one that performs best. (See the next section.)
Remark: In this exercise we are using hashing in two ways. First, we are handling new key types such as strings using a hash function. Second, we are using a hash table for our underlying data structure. These two uses of hashing have little to do with each other. I could ask you to write a new Dictionary
based on BinarySearchTree
that uses a hash function to handle various kinds of keys. (I am not asking you to do that, because it has meager educational value.) Alternatively, I could ask you to implement Dictionary
using a hash table but not require you to handle float
or str
keys. (I am not asking you to do that either.) Make sure that you understand the distinction here.
Dictionary
's PerformanceThe efficiency of your hash-table-based dictionary depends somewhat on how good your hash function is. Even if the hash function is uniform, your dictionary will still act slowly whenever the hash table needs to resize. That is, most putting operations should be fast, but occasionally putting should be slow.
Design and run a timing test that clearly demonstrates the performance characteristics of your Dictionary
class. In a file searchdiscussion.txt
list your test code, show your test data, and explain everything.
In this part of the assignment you will use your new Dictionary
class to implement search with context, as described above. Here are the basic ingredients.
The code already in main()
fetches the web page for you, processes the text in various ways, and obtains a list of pairs of indices. Each of these pairs corresponds to a word in the text. For each word, do the following.
/usr/share/dict/connectives
. You may find other word files in /usr/share/dict/
useful. You may also want to check out Python string methods.
As you can see below, some of the credit on this assignment is reserved for going beyond the bare requirements to produce an exceptional program. Here are some ideas.
Dictionary
's __str__()
method to mimic the built-in Python dictionary (as the Dictionary
in dictionarybst.py
does).
searchdiscussion.txt
explaining what you did that deserves special consideration.
Submit your searchdiscussion.txt
and search.py
files electronically by Sunday 11:59 PM. They will be graded based on the following criteria.
Dictionary
works and is well-written. (3 points)
searchdiscussion.txt
. (3 points)