2008 November 15 / jjddaavviiss@@ccaarrlleettoonn..eedduu
Carleton College CS 201, Prof. Joshua R. Davis
In the preceding assignment we got a glimpse of how a web search engine might process a web page to extract its search terms. In this assignment we look at another aspect of the web search problem: How does the search engine "crawl" the web to get all of those web pages in the first place?
As we have discussed in class, the World Wide Web can be viewed as a large directed graph. The vertices are the web pages (identified by their URLs — that is, addresses), and an edge is a hypertext link from one page to another. A web crawling program starts at some web page, follows that page's links to its neighbors, follows those pages' links to their neighbors, and so on. The search is complicated a little by the fact that the graph is not loaded into memory ahead of time; it must be discovered bit-by-bit as the web crawler moves through the web.
Specifically, in this assignment I'd like you to use a breadth-first or depth-first search of (a part of) the World Wide Web to solve some problem of interest. If you can't think of a problem, I suggest one for you to try.
Here are some files you might need. The code comes from our textbook, but I have made some modifications to Graph
to make it a little clearer (I hope).
Graph
class; it imports vertex.py.
Vertex
class used by Graph
.
Before writing your own code, you'll want to make sure you understand the code that's already in graph.py, vertex.py, and webcrawler.py. For example, you should ask yourself, "In a Graph
of n vertices, what is the efficiency of the methods that add/get vertices/edges?". (You don't have to hand this in; I'm just giving you a study tip. You should always be asking such questions.) In webcrawler.py there are a few elements to ponder too, as follows.
The links()
function takes as input a string containing all of the HTML code of a web page. It returns as output a list of strings, namely all of the URLs in links in the web page. A link in HTML is formed by the construct
<a href="http://people.carleton.edu/~jdavis/">this is a link to Josh's web page</a>The
links()
function scans the HTML for <a href="
, then gobbles up the HTML until it finds the next "
. It repeats this until it reaches the end of the HTML. The structure of the code resembles something called a finite state automaton; the same task could be accomplished by a regular expression. You will learn about these concepts in CS 254. I just mention them now to suggest that knowing computer science, including theoretical concepts, can help you be a better computer programmer.
The Python library urllib
fetches web pages for us. We used it in the preceding assignment. This time we'll use it repeatedly, because we have many web pages to fetch.
After the links from the web page are extracted with links()
, there is still a problem in processing these links. In particular, not all links in a web page are complete in themselves. Here are a few examples.
http://people.carleton.edu/~jdavis/
, but it is really at http://people.carleton.edu/~jdavis/index.html
. When asked for a URL indicating a directory (that is, ending in a slash), Carleton's web server automatically serves up the file index.html
in that directory.
2008f201/
. As it is typed, this URL doesn't work, because it is missing its beginning. We add on the beginning by realizing/deciding that the new URL must be in the same directory as the old one. The old directory was http://people.carleton.edu/~jdavis/
, so that is prepended to 2008f201/
to form http://people.carleton.edu/~jdavis/2008f201/
.
http:
and end in .html
. This is what the line
urls = [url for url in urls if len(url) >= 5 and url[:5] == 'http:' and url[-5:] == '.html']in webcrawler.py does. On many web sites, such as the New York Times, this discards a significant fraction of the URLs but leaves plenty of interesting ones intact. On this assignment, as long as you crawl a part of the web that has plenty of complete links, you shouldn't worry about omitting the incomplete ones. (You may need to try a few web sites, to find one that gives you nice links to crawl. For example, Wikipedia doesn't work.)
When I did this assignment yesterday, the problem I set for myself was this: Starting from a specified URL (say http://www.nytimes.com/
), how many web pages are within distance 1 of it? How many are within distance 2 of it? How about distance 3? (I stopped at 3, because I found upon running my program that the numbers were getting large.) That is, how does the number of web pages grow, as the allowable distance from the starting page grows?
I decided to write a function countWithinDistance(starturl, distance)
that takes in a starting URL and a maximum distance, and returns a list of numbers: the number of web pages with 1 step of the starting URL, the number within 2 steps, the number within 3 steps, and so on, up to the specified maximum distance.
If you like, you may use this as your problem. Alternatively, you can come up with your own problem to solve. Your problem should be at least as interesting as mine (that shouldn't be hard) and at least as difficult as mine. It should be solvable by a breadth-first or depth-first search. A clear, complete description of your problem should appear in a comment near the top of your webcrawler.py.
Using the elements described above (links()
, urllib
, discarding incomplete URLs) write a breadth-first or depth-first search that solves your problem. To get you started, I have some suggestions.
Graph
and an empty Queue
or Stack
. You prime these with the starting vertex. After priming, you loop until the queue/stack is empty. I recommend that you use URLs as keys to your graph; the values are Vertex
objects.
Graph
yet; you need to add them, and add edges from the current vertex to them. Then you can proceed with the usual breadth-first or depth-first search process (process each white neighbor).
In a comment near the top of webcrawler.py, discuss the results of these tests. What have you learned about your problem? For example, in my web crawler I added a line to print out each URL as it was processed. Starting at http://www.nytimes.com/
, I found that I could not escape the New York Times (in 3 or fewer steps); there were no links leading outside the site. This taught me something about how much they want to keep me on their site, looking at their advertisements.
Submit your webcrawler.py file electronically by Wednesday 5:00 PM. It will be graded based on the following criteria.