2008 November 15 / jjddaavviiss@@ccaarrlleettoonn..eedduu

Web Crawling

Carleton College CS 201, Prof. Joshua R. Davis

In the preceding assignment we got a glimpse of how a web search engine might process a web page to extract its search terms. In this assignment we look at another aspect of the web search problem: How does the search engine "crawl" the web to get all of those web pages in the first place?

As we have discussed in class, the World Wide Web can be viewed as a large directed graph. The vertices are the web pages (identified by their URLs — that is, addresses), and an edge is a hypertext link from one page to another. A web crawling program starts at some web page, follows that page's links to its neighbors, follows those pages' links to their neighbors, and so on. The search is complicated a little by the fact that the graph is not loaded into memory ahead of time; it must be discovered bit-by-bit as the web crawler moves through the web.

Specifically, in this assignment I'd like you to use a breadth-first or depth-first search of (a part of) the World Wide Web to solve some problem of interest. If you can't think of a problem, I suggest one for you to try.

0. Obtain Files

Here are some files you might need. The code comes from our textbook, but I have made some modifications to Graph to make it a little clearer (I hope).

graph.py implements the Graph class; it imports vertex.py.
vertex.py implements the Vertex class used by Graph.
queue.py is needed for a breadth-first search.
stack.py is needed for a depth-first search.
webcrawler.py is the program you'll be editing.

1. Understand The Code Already There

Before writing your own code, you'll want to make sure you understand the code that's already in graph.py, vertex.py, and webcrawler.py. For example, you should ask yourself, "In a Graph of n vertices, what is the efficiency of the methods that add/get vertices/edges?". (You don't have to hand this in; I'm just giving you a study tip. You should always be asking such questions.) In webcrawler.py there are a few elements to ponder too, as follows.

The links() function takes as input a string containing all of the HTML code of a web page. It returns as output a list of strings, namely all of the URLs in links in the web page. A link in HTML is formed by the construct

<a href="http://people.carleton.edu/~jdavis/">this is a link to Josh's web page</a>

The links() function scans the HTML for <a href=", then gobbles up the HTML until it finds the next ". It repeats this until it reaches the end of the HTML. The structure of the code resembles something called a finite state automaton; the same task could be accomplished by a regular expression. You will learn about these concepts in CS 254. I just mention them now to suggest that knowing computer science, including theoretical concepts, can help you be a better computer programmer.

The Python library urllib fetches web pages for us. We used it in the preceding assignment. This time we'll use it repeatedly, because we have many web pages to fetch.

After the links from the web page are extracted with links(), there is still a problem in processing these links. In particular, not all links in a web page are complete in themselves. Here are a few examples.

My web page can be found at the URL http://people.carleton.edu/~jdavis/, but it is really at http://people.carleton.edu/~jdavis/index.html. When asked for a URL indicating a directory (that is, ending in a slash), Carleton's web server automatically serves up the file index.html in that directory.
On my web page I have links like 2008f201/. As it is typed, this URL doesn't work, because it is missing its beginning. We add on the beginning by realizing/deciding that the new URL must be in the same directory as the old one. The old directory was http://people.carleton.edu/~jdavis/, so that is prepended to 2008f201/ to form http://people.carleton.edu/~jdavis/2008f201/.
In other words, a web page can specify a URL relative to its own location. In fact, it can even specify URLs relative to some other location, called the "base href" of the page. This is specified in the header of the HTML file.

There are many more subtleties to HTML than this, but I don't really expect you to learn them. I'm telling you this stuff simply to justify this decision: Let's discard all URLs that do not begin in http: and end in .html. This is what the line

urls = [url for url in urls if len(url) >= 5 and url[:5] == 'http:' and url[-5:] == '.html']

in webcrawler.py does. On many web sites, such as the New York Times, this discards a significant fraction of the URLs but leaves plenty of interesting ones intact. On this assignment, as long as you crawl a part of the web that has plenty of complete links, you shouldn't worry about omitting the incomplete ones. (You may need to try a few web sites, to find one that gives you nice links to crawl. For example, Wikipedia doesn't work.)

2. Specify Your Problem

When I did this assignment yesterday, the problem I set for myself was this: Starting from a specified URL (say http://www.nytimes.com/), how many web pages are within distance 1 of it? How many are within distance 2 of it? How about distance 3? (I stopped at 3, because I found upon running my program that the numbers were getting large.) That is, how does the number of web pages grow, as the allowable distance from the starting page grows?

I decided to write a function countWithinDistance(starturl, distance) that takes in a starting URL and a maximum distance, and returns a list of numbers: the number of web pages with 1 step of the starting URL, the number within 2 steps, the number within 3 steps, and so on, up to the specified maximum distance.

If you like, you may use this as your problem. Alternatively, you can come up with your own problem to solve. Your problem should be at least as interesting as mine (that shouldn't be hard) and at least as difficult as mine. It should be solvable by a breadth-first or depth-first search. A clear, complete description of your problem should appear in a comment near the top of your webcrawler.py.

3. Write Your Web Crawler

Using the elements described above (links(), urllib, discarding incomplete URLs) write a breadth-first or depth-first search that solves your problem. To get you started, I have some suggestions.

Don't write your code from scratch; base it on the breadth-first or depth-first search in our textbook.
Roughly speaking, you begin by constructing an empty Graph and an empty Queue or Stack. You prime these with the starting vertex. After priming, you loop until the queue/stack is empty. I recommend that you use URLs as keys to your graph; the values are Vertex objects.
One of the first tasks inside the loop is to grab all of the current vertex's neighbors. To do this, you first must fetch the HTML for the current vertex, then extract the links from it, then discard the incomplete links. The remaining URLs should be the current vertex's neighbors, but they're not in your Graph yet; you need to add them, and add edges from the current vertex to them. Then you can proceed with the usual breadth-first or depth-first search process (process each white neighbor).
Ordinarily, when you process a white neighbor you set some of its parameters and then enqueue it. But in this assignment you should be careful in what you choose to enqueue. If the vertex's distance (from the starting vertex) is already pretty big, then maybe you don't want to enqueue it. After all, you have to stop somewhere; you can't process the whole World Wide Web by Wednesday. Your stopping point should be clearly defined in your problem statement, as it was in mine above.

Write your code in webcrawler.py. Include sufficient testing code, so that the grader can be confident that your program works as intended. You need at least two tests, I think.

In a comment near the top of webcrawler.py, discuss the results of these tests. What have you learned about your problem? For example, in my web crawler I added a line to print out each URL as it was processed. Starting at http://www.nytimes.com/, I found that I could not escape the New York Times (in 3 or fewer steps); there were no links leading outside the site. This taught me something about how much they want to keep me on their site, looking at their advertisements.

4. Submit Your Work Electronically

Submit your webcrawler.py file electronically by Wednesday 5:00 PM. It will be graded based on the following criteria.

In a comment near the top of webcrawler.py, you clearly and completely describe the problem and its results/conclusions (3 points).
Your program works on at least two nontrivial tests (3 points).
Your code is clear and well-commented (1 point).