2010 September 18 / |j|d|a|v|i|s|@|c|a|r|l|e|t|o|n|.|e|d|u

Assignment: XML

Carleton College CS 201, Prof. Joshua R. Davis

Introduction

XML is a language for describing data in a particular structured way. The structure of an XML file is always "tree-like" or "stacky". These are vague terms, but by the end of the course you'll understand exactly what they mean. Nowadays many computer programs use XML-based file formats. One example is iTunes, an audio-playing program available on Mac OS X and Windows.

In this assignment you will use stacks to check whether an iTunes XML document is syntactically well-formed. This is essentially a fancy version of the parentheses-matching problem described in our textbook. In addition to our class' stack.py file, you'll need these files:

You must complete and hand in this assignment individually, not with any partner, although as always you are encouraged to bounce ideas off other students. The assignment is due Friday 11:59 PM.

Acquaint Yourself with playlist.xml

If you view playlist.xml in the Safari web browser, then you'll see a bunch of text crammed together; it's not very enlightening. If you view playlist.xml in the Firefox web browser, then you'll see a hierarchical, "tree-like" view of its contents; Firefox understands the structure of XML files. (So does Safari, but it refuses to show you the structure.) If you view endcut.xml, startcut.xml, and mismatch.xml in Firefox, then the program complains that they are not well-formed XML files; that's what this assignment is all about.

To understand what's going on, examine the playlist.xml file in TextWrangler. As you can see, an XML document is just a bunch of text, but sprinkled with nuggets enclosed in < and >, such as <dict>; these are called tags.

The first two lines in playlist.xml are special. The first line has an ?xml tag, which indicates that this is an XML file. This tag has two attributes, naming the intended XML version and text encoding. The second line has a !DOCTYPE tag, which has an attribute describing what kind of XML document we're looking at. In this assignment we're going to ignore these two special lines.

After the first two lines, the file takes on a rigid structure. There are three kinds of tags. What distinguishes them is whether they have a slash / in them, and where.

After the first two lines, the structure of playlist.xml is "tree-like". It is a nest of opening and closing tags. The whole file (after the first two lines) is contained in a single <plist> </plist> pair. Inside this plist pair is a single <dict> </dict> pair. Inside that dict pair are many keys and other things.

The important point here is that all tags (after the first two lines) are properly nested. If a tag is opened inside a matching pair of opening/closing tags, then it is closed inside that same matching pair. Put another way, each closing tag matches the most-recently-opened-but-not-yet-closed tag. We say that the XML file is well-formed if the tags nest properly like this. In this assignment you will write a program to check whether a given XML file is well-formed.

To check well-formedness, you don't need to know what the tags actually mean; you just need to check that they are correctly nested. Nevertheless, the tags here should be saying something to you. Notice that a dict contains keys, with each key followed by exactly one other object, before the next key. Where have you seen this sort of thing before?

Acquaint Yourself with the Code

The file xml.py already contains some code. Read through it. The demo section begins with a line for specifying which XML file to check. Then there are a few lines that read the contents of the file into a string. Then there is a strange line

tags = re.findall(r'<(/?\w+/?)[^>]*>', fileString)
containing some crazy gibberish called a regular expression. A regular expression is a formula — not a mathematical formula, but a linguistic formula — for describing a certain kind of text. The regular expression you see here is carefully designed to describe XML tags. The Python function re.findall extracts all strings that match the regular expression — that is, all XML tags — in an extremely simple and fast way.

(In this course you are not expected to learn regular expressions. CS 254 covers them and many other topics. Also, for the sake of completeness, I must admit that the regular expression above describes slightly more than XML tags. But it is sufficiently precise for our purposes.)

The demo section ends by printing out all of the tags. I recommend that you run this code once as it is, to see what the tags look like.

Write Your Well-Formedness Checker

Your assignment is to replace the line that prints out the tags with a bunch of code that uses a stack to check the well-formedness of the XML. The key ideas are covered in our textbook, in the section on checking nesting of parenthesized expressions. Essentially, each XML tag is a different kind of parenthesis, with opening (e.g. dict) and closing (e.g. /dict) variants. Remember that an empty tag (e.g. true/) opens and immediately closes itself; how do you treat this? (Although <true/> is the only empty tag in playlist.xml, your program should be able to handle any empty tag.)

If the XML is well-formed, then your program should print out a happy message to that effect. If the XML is not well-formed, then there are several different ways in which it could have failed. You should print out a detailed error message to tell the user what went wrong. Part of your error message should be the contents of the stack at the time the error occurred. This helps the user figure out where the bad part of the XML file is. For example, here's what my program does when I hand it a well-formed XML file:

jdavis$ python xml.py
The file playlist.xml correctly nests XML tags.
Here's what my program does with an XML file that closes tags incorrectly:
jdavis$ python xml.py
The file mismatch.xml does not correctly nest XML tags.
The tag /sminteger was found where the following tags were expected 
to be closed (starting with the inner-most tag):
integer
dict
dict
dict
plist
I have provided you with three damaged iTunes XML files, in addition to the undamaged playlist.xml. Make sure that your program works correctly on all of them. You may want to test your program on other iTunes XML files as well; I'm sure you can find someone who uses iTunes, to get more samples.

Submit Your Work

Submit your xml.py file electronically, as usual. Don't forget Good Python Style. The grader will test your program on several XML files — not necessarily the ones given above. Here are the grading criteria.