2010 October 4 / |j|d|a|v|i|s|@|c|a|r|l|e|t|o|n|.|e|d|u

Assignment: Interpretation

Carleton College CS 201, Prof. Joshua R. Davis

Introduction

In this assignment you will write an interpreter for a simple programming language. Here's an example of my version of the interpreter in action:

Press Control-D to exit.
>> x = 2.718
 2.718
>> ln x
 0.999896315729
>> 13 + 4
 17.0
>> x = 3 * (2 ^ 4)
 48.0
>> x
 48.0
>> jim = sqrt x
 6.92820323028
>> (17 - (jim * 2.43))
 0.164466150431

The interpreter is a glorified calculator, I admit, but it does give a glimpse of how one might create an interpreter for a "real" programming language. (See the epilogue.) The essence of any interpreter is repeating the following steps in a never-ending loop:

Read input from the user.
Scan the input into tokens, according to the word-formation rules of the language.
Parse the list of tokens into a tree, according to the grammar of the language.
Evaluate the parse tree, using the environment to look up the values of any variables.
Print the result for the user, or an error message.

The input and scanning steps are done for you. The parsing and evaluation steps are the crux of this assignment. The final step is easy. You may complete this assignment with one partner. It is due Wednesday at class time.

Acquaint Yourself with `interpreter.py`

Download interpreter.py to your computer. This is the file you'll be editing. You'll also need stack.py, binarytreenode.py, and parser.py from our course web site. Do not use our textbook's parsing code; use our course web site's code.

Examine the code that's already in interpreter.py. In the main section, we begin by defining an environment. The environment exists to match symbols to their meanings during the evaluation step. For example, the + symbol represents addition, which is implemented in Python as the add function in the operator library. You can see this in the environment-making code. You'll add more entries to the environment as you construct your interpreter.

After the environment is defined the program enters its endless loop. The code for obtaining input from the user is already there. The code for scanning the input into a list of tokens is already there; it uses a regular expression, which we have seen in earlier assignments but which lies beyond the scope of this course. (You could certainly write your own scanner without regular expressions; it's just tedious.)

After the scanning step will come your code to parse, evaluate, and print results. As you can tell from the transcript above, there are going to be several features beyond what we've discussed in class. You can add these features in any order you want, but I recommend that you get a basic parser/evaluator working and then incrementally add features to it. The following sections lead you through this process systematically.

Write a Basic Parser

Copy the parse() function from parser.py into interpreter.py. Now you have a basic parser for arithmetic expressions such as (3 + 5). In your main section, you can parse the token list and print out the parse tree. As you construct your interpreter, printing the parse tree can really help you figure out what's going on.

What if the user enters an extra layer of parentheses, as in ((3 + 5))? Mathematically, the extra layer shouldn't matter; the expression should still evaluate to 8. What kind of parse tree results, and why? There's nothing to change here; just be aware of what's happening.

The user should be able to enter 3 + 5 instead of (3 + 5), right? But your parser doesn't handle it, because there are too few parentheses. You could adjust the parser, but it's easier just to wrap the user input in a pair of parentheses before passing it to the parser. Do this. Now a user input of 3 + 5 becomes (3 + 5), which is parsed well. On the other hand, (3 + 5) becomes ((3 + 5)); that's okay too.

Write a Basic Evaluator

Now it's time to write the evaluator. This is the first big part of the assignment. The evaluator is a function, evaluate(), that takes in a parse tree and an environment and returns the numerical value of the parse tree. If some error occurs, then evaluate() prints an error message and returns None.

Recall from class that evaluation can be done recursively. The base case occurs when the tree in question has no children. In this case, the token in the tree node must be either a numeral such as -32.931 or an identifier such as x, jim, or pi. In either case, your job is to extract the numerical value corresponding to the token.

If the token is a numeral, then you can extract its numerical value using Python's float function. If you try this on a token that is not a numeral, such as 2.5josh328, then float raises a ValueError. In Python you handle errors like this:

try:
    value = float(token)
except ValueError:
    # handle the error here...

If the token is an identifier, then you extract its numerical value by looking it up in the environment. Either the identifier is in the environment or it's not. If it's not, then the environment dictionary raises a KeyError error, which you have to handle.

The recursive case of evaluation occurs when the tree has children — in other words, when the root is a branch node. You evaluate the subtrees; then you combine the values of the subtrees using whatever function is indicated by the token in the branch node. For example, if the token is + then you add the values of the subtrees. You could hard-code the idea that "+ means add" into your program, but it's better to let the environment handle this as well. That is, you should look the token up in the environment, the environment should return a Python function, and you should apply that function to the values of the subtrees. Of course, if the branch node token is unknown to the environment, then a KeyError results, and you have to handle that.

In the main section of interpreter.py you can see that the environment already knows the identifiers + and ^. You should add entries for the mathematical constants e and pi. (They are defined in Python's math library.)

Once you have implemented the ideas above, your evaluator is almost done. You just need to print out the evaluated value for your user. Once you've done that, try your interpreter; it should be able to do this much:

>> (3 + pi)
 6.14159265359
>> (e ^ 2.0)
 7.38905609893
>> (x + 1.8)
 error: unknown variable x
>> (3 * 4)
 error: unknown operator *

Incorporate More Functions

In addition to + and ^ you need to handle multiplication *, subtraction -, and division /. The scanner already handles these. Does the parser handle them? If not, adjust it. Does the evaluator handle them? Yes, once you adjust the environment.

You should also handle the five unary functions sqrt, ln, sin, cos, and tan. (Unary means that they take in one argument; a binary function takes in two arguments.) Unary functions require adjustments to the parser, evaluator, and environment. This is the second big part of the assignment.

Incorporate Assignment

Your interpreter must also handle the assignment of values to variables, as in the statement x = 3 + 5. Let's agree that assignment statements all have the form

identifier = expression

where identifier is an identifier (either currently in the environment or not) and expression is any valid expression, like the ones we've been handling up to now. To be clear, there can't be any = in expression; there can be only one = per statement. The assignment statement does two things: It sets the value of the identifier, and it returns that value. You can see this behavior in the first code listing, at the top of this web page.

This is the third and last big part of the assignment. I have two hints to offer.

Your user should be able to enter stuff like x = 3 + 5. You're going to have to add some parentheses to this thing. How, exactly? If you do this well, then your parser can regard = as just another binary operator, like +, so no changes to the parser are necessary.
Although your parser regards = like any binary operator, your evaluator treats = quite differently from +, say. How? This is a very important point! Ordinary functions such as + and sin are processed uniformly: evaluate all of the arguments, look up the function in the environment, and apply the function to the arguments. In contrast, = must be built into the evaluator specifically. It is a keyword of the language rather than a mere function.

Submit Your Work

Test your interpreter thoroughly. It should be able to handle commands as seen in the transcript at the start of this assignment, and similar commands. Although it would be nice, I do not expect you to handle stuff like 3 - 9 + 5; that would require complicated parsing or complicated processing before parsing.

Submit your interpreter.py file electronically as usual. It will be graded based on these criteria:

Basic parsing and evaluation work (3 points).
All required functions (*, sin, etc.) work (3 points).
Variable assignment works (3 points).
The code exhibits Good Python Style (3 points).

Epilogue

This assignment is supposed to give you a glimpse of how a real programming language works. Implementing a programming language is very much like what we've done here — just more of it. Here are a few points to consider.

The syntax in a typical programming language is more complicated than our syntax here. The designer of the language describes the syntax more precisely than we have, so that users know exactly what to expect. She also must write a sophisticated parser to handle that syntax. Somewhat surprisingly, if the syntax is simple enough and described precisely enough, then the parser can be generated automatically from it. This is discussed in some later CS courses such as CS 251 and CS 254.

Our language is all about numbers. In a typical programming language there are other data types, such as strings, and many more built-in functions to operate on those data types.

A typical programming language allows for user-defined functions. This feature may cause a lot of syntax to be added to the language. More importantly, it necessitates a more sophisticated evaluation process. For example, suppose that I want the users of my language to be able to define functions like this:

>> f = (x return (x * x))
 function f defined
>> f(3)
 9.0

The syntax isn't very bad; we just need to adjust the parser to treat return as a binary operator. The parse tree is something like

[=, [f, [], []], [return, [x, [], []], [*, [x, [], []], [x, [], []]]]]

Like =, return is actually a keyword to be handled specially by the evaluator. Neither of its subtrees are evaluated. Instead, the parse subtree

[return, [x, [], []], [*, [x, [], []], [x, [], []]]]

is put into the environment as the value for the identifier f. Let me repeat: The value of f is a parse tree. When the user actually uses the function, as in f(3), the evaluator first evaluates the token 3 to the number 3.0 and f to the parse tree. Then it evaluates that parse tree, but using the value 3.0 for x instead of whatever value might be stored for x in the environment. The local value of x in the function call f(x) eclipses (or shadows) whatever standing value x might have.

Once you have basic user-defined functions working, other issues immediately arise. For example, how can users define functions that take more than one argument? How do you keep track of how many arguments a function takes? How do you incorporate multiple arguments into the evaluation process? If the language supports data types other than numbers, then how do you keep track of the types? This is all kind of tricky. I considered including some of it in this assignment, but then decided that the assignment was difficult enough.