2008 November 15 / jjddaavviiss@@ccaarrlleettoonn..eedduu

Web Crawling

Carleton College CS 201, Prof. Joshua R. Davis

In the preceding assignment we got a glimpse of how a web search engine might process a web page to extract its search terms. In this assignment we look at another aspect of the web search problem: How does the search engine "crawl" the web to get all of those web pages in the first place?

As we have discussed in class, the World Wide Web can be viewed as a large directed graph. The vertices are the web pages (identified by their URLs — that is, addresses), and an edge is a hypertext link from one page to another. A web crawling program starts at some web page, follows that page's links to its neighbors, follows those pages' links to their neighbors, and so on. The search is complicated a little by the fact that the graph is not loaded into memory ahead of time; it must be discovered bit-by-bit as the web crawler moves through the web.

Specifically, in this assignment I'd like you to use a breadth-first or depth-first search of (a part of) the World Wide Web to solve some problem of interest. If you can't think of a problem, I suggest one for you to try.

0. Obtain Files

Here are some files you might need. The code comes from our textbook, but I have made some modifications to Graph to make it a little clearer (I hope).

1. Understand The Code Already There

Before writing your own code, you'll want to make sure you understand the code that's already in graph.py, vertex.py, and webcrawler.py. For example, you should ask yourself, "In a Graph of n vertices, what is the efficiency of the methods that add/get vertices/edges?". (You don't have to hand this in; I'm just giving you a study tip. You should always be asking such questions.) In webcrawler.py there are a few elements to ponder too, as follows.

The links() function takes as input a string containing all of the HTML code of a web page. It returns as output a list of strings, namely all of the URLs in links in the web page. A link in HTML is formed by the construct

<a href="http://people.carleton.edu/~jdavis/">this is a link to Josh's web page</a>
The links() function scans the HTML for <a href=", then gobbles up the HTML until it finds the next ". It repeats this until it reaches the end of the HTML. The structure of the code resembles something called a finite state automaton; the same task could be accomplished by a regular expression. You will learn about these concepts in CS 254. I just mention them now to suggest that knowing computer science, including theoretical concepts, can help you be a better computer programmer.

The Python library urllib fetches web pages for us. We used it in the preceding assignment. This time we'll use it repeatedly, because we have many web pages to fetch.

After the links from the web page are extracted with links(), there is still a problem in processing these links. In particular, not all links in a web page are complete in themselves. Here are a few examples.

There are many more subtleties to HTML than this, but I don't really expect you to learn them. I'm telling you this stuff simply to justify this decision: Let's discard all URLs that do not begin in http: and end in .html. This is what the line
urls = [url for url in urls if len(url) >= 5 and url[:5] == 'http:' and url[-5:] == '.html']
in webcrawler.py does. On many web sites, such as the New York Times, this discards a significant fraction of the URLs but leaves plenty of interesting ones intact. On this assignment, as long as you crawl a part of the web that has plenty of complete links, you shouldn't worry about omitting the incomplete ones. (You may need to try a few web sites, to find one that gives you nice links to crawl. For example, Wikipedia doesn't work.)

2. Specify Your Problem

When I did this assignment yesterday, the problem I set for myself was this: Starting from a specified URL (say http://www.nytimes.com/), how many web pages are within distance 1 of it? How many are within distance 2 of it? How about distance 3? (I stopped at 3, because I found upon running my program that the numbers were getting large.) That is, how does the number of web pages grow, as the allowable distance from the starting page grows?

I decided to write a function countWithinDistance(starturl, distance) that takes in a starting URL and a maximum distance, and returns a list of numbers: the number of web pages with 1 step of the starting URL, the number within 2 steps, the number within 3 steps, and so on, up to the specified maximum distance.

If you like, you may use this as your problem. Alternatively, you can come up with your own problem to solve. Your problem should be at least as interesting as mine (that shouldn't be hard) and at least as difficult as mine. It should be solvable by a breadth-first or depth-first search. A clear, complete description of your problem should appear in a comment near the top of your webcrawler.py.

3. Write Your Web Crawler

Using the elements described above (links(), urllib, discarding incomplete URLs) write a breadth-first or depth-first search that solves your problem. To get you started, I have some suggestions.

Write your code in webcrawler.py. Include sufficient testing code, so that the grader can be confident that your program works as intended. You need at least two tests, I think.

In a comment near the top of webcrawler.py, discuss the results of these tests. What have you learned about your problem? For example, in my web crawler I added a line to print out each URL as it was processed. Starting at http://www.nytimes.com/, I found that I could not escape the New York Times (in 3 or fewer steps); there were no links leading outside the site. This taught me something about how much they want to keep me on their site, looking at their advertisements.

4. Submit Your Work Electronically

Submit your webcrawler.py file electronically by Wednesday 5:00 PM. It will be graded based on the following criteria.