2009 April 24 / |j|d|a|v|i|s|@|c|a|r|l|e|t|o|n|.|e|d|u
Carleton College CS 201, Prof. Joshua R. Davis
In this assignment you will write an interpreter for a simple programming language. Here's an example of my version of the interpreter in action:
The interpreter is a glorified calculator, I admit, but it does give a glimpse of how one might create an interpreter for a "real" programming language. (See the epilogue.) The essence of any interpreter is repeating the following steps in a never-ending loop:Press Control-D to exit. >> x = 2.718 2.718 >> ln x 0.999896315729 >> 13 + 4 17.0 >> x = 3 * (2 ^ 4) 48.0 >> x 48.0 >> jim = sqrt x 6.92820323028 >> (17 - (jim * 2.43)) 0.164466150431
You may complete this assignment with one partner. It is due Friday May 1 at class time (1:10 PM).
You'll need this stuff.
interpreter.py
is the interpreter file you'll be editing.
parser.py
is the parser code from class. It's similar to the book's, but it doesn't mix parsing with scanning and evaluating, as the book's does.
stack.py
is our usual Stack
data structure. You need it for parsing.
binarytree.py
is our usual BinaryTree
, also needed for parsing.
interpreter.py
Examine the code that's already in interpreter.py
. In main()
, we begin by defining an environment. The environment exists to match symbols to their meanings during the evaluation step. For example, the +
symbol represents addition, which is implemented in Python as the add
function in the operator
library. You can see this in the environment-making code. You'll add more entries to the environment as you construct your interpreter.
After the environment is defined the program enters its endless loop. The code for obtaining input from the user is already there. The code for scanning the input into a list of tokens is already there; it uses a regular expression, which we have seen in earlier assignments but which lies beyond the scope of this course. (You could certainly write your own scanner without regular expressions; it's just tedious.)
After the scanning step will come your code to parse, evaluate, and print results. As you can tell from the transcript above, there are going to be several features beyond what we've discussed in class. You can add these features in any order you want, but I recommend that you get a basic parser/evaluator working and then incrementally add features to it. The following sections lead you through this process systematically.
Simply copy the parse()
function from parser.py
into interpreter.py
. This is a basic parser for arithmetic expressions. In your main()
function, you can now parse the token list and print out the parse tree. As you construct your interpreter, printing the parse tree can really help you figure out what's going on.
Now it's time to write the evaluator. This is the first big part of the assignment. The evaluator is a function, evaluate()
, that takes in a parse tree and an environment and returns the numerical value of the parse tree. If some error occurs, then evaluate()
prints an error message and returns None
.
Recall from class that evaluation can be done recursively. The base case occurs when the tree in question has no children. In this case, the root token in the tree must be either a numeral or an identifier such as x
, jim
, or pi
. When you encounter an identifier, you look up its value in the environment. Either the identifier is in the environment or it's not. If it's not, then the environment dictionary raises a KeyError
error. In Python you handle errors like this:
When you encounter a numeral, you can convert it to a number using Python'stry: value = environment[token] except KeyError: # handle the error here...
float()
function. If the string in question is not a valid numeral, such as 2.5josh328
, then float()
raises a ValueError
, which you handle similarly to how you handled the KeyError
above.
The recursive case of evaluation occurs when the tree has children — in other words, when the root is a branch node. You evaluate the subtrees and combine the results using whatever function is indicated by the root. For example, if the root token is +
then you return the sum of the values of the children. You could hard-code the idea of "+
means add" into your program, but it's better to let the environment handle this as well. That is, you should look the root token up in the environment, the environment should return a Python function, and you should apply that function to the values of the subtrees. Of course, if the root token is unknown to the environment, then a KeyError
results, and you have to handle that.
In main()
you can see that the environment already knows the identifiers +
and ^
. You should add entries for the mathematical constants e
and pi
. (They are defined in Python's math
library.)
Once your evaluator is working, you just need to print its result for the user. At this point your interpreter should be able to do this much:
>> (3 + pi) 6.14159265359 >> (e ^ 2.0) 7.38905609893 >> (x + 1.8) error: unknown variable x >> (3 * 4) error: unknown operator *
What if the user enters an extra layer of parentheses, as in ((3 + 5))
? The extra layer shouldn't matter; the expression should still evaluate to 8
. So how do you handle this? I recommend that you do not alter your parser. Instead, study what kind of parse tree results, and handle it in your evaluator.
Also, the user should be able to enter 3 + 5
instead of (3 + 5)
, right? But your parser doesn't handle it, because there are too few parentheses. You could adjust the parser, but it's easier just to wrap the user input in a pair of parentheses before passing it to the parser. If the user has already wrapped the input in parentheses, then the parentheses you slap on are redundant, but that's no problem, right?
In addition to +
and ^
you need to handle multiplication *
, subtraction -
, and division /
. The scanner already handles these. Does the parser handle them? If not, adjust it. Does the evaluator handle them? Yes, once you adjust the environment.
You should also handle the five unary functions sqrt
, ln
, sin
, cos
, and tan
. (Unary means that they take in one argument; a binary function takes in two arguments.) Unary functions require adjustments to the parser, evaluator, and environment. This is a big part of the assignment.
Your interpreter must also handle the assignment of values to variables, as in the command x = 3 + 5
. This is the last big part of the assignment. I have a few recommendations.
Let's make it a rule of our language that there can be at most one =
in a command, and it must occur as the second token, with the first token being an identifier (either in the environment or not). The assignment command sets the value of the identifier and returns that value.
So that your user can enter x = 3 + 5
instead of x = (3 + 5)
, wrap everything after the =
in an extra layer of parentheses. Then your code from earlier should wrap the entire thing in parentheses to make (x = (3 + 5))
before passing it to the parser. Given this code, your parser can regard =
as just another binary operator, like +
, and everything will parse well.
Although your parser regards =
like any binary operator, your evaluator treats =
quite differently from +
, say. How?
This is a very important point! Ordinary functions such as +
and sin
are processed uniformly: evaluate all of the arguments, look up the function in the environment, and apply the function to the arguments. In contrast, =
must be built into the evaluator specifically. It is a keyword of the language rather than a mere function.
Test your interpreter thoroughly. It should be able to handle commands as seen in the transcript at the start of this assignment. Although it would be nice, I do not expect you to handle stuff like 3 - 9 + 5
; that would require complicated parsing or complicated processing before parsing.
Make sure your interpreter.py
file is good Python, as described in the preceding assignment: documentation strings, a proper main()
function, appropriate comments, clear code, etc.
Submit your interpreter.py
file electronically (either using hsp
or by putting it on COURSES manually). It will be graded based on these criteria:
*
, sin
, etc.) work. (3 points)
This assignment is supposed to give you a glimpse of how a real programming language works. Implementing a programming language is very much like what we've done here — just more of it. Here are a few points to consider.
The syntax in a typical programming language is more complicated than our syntax here. The designer of the language describes the syntax more precisely than we have, so that users know exactly what to expect. She also must write a sophisticated parser to handle that syntax. Somewhat surprisingly, if the syntax is simple enough and described precisely enough, then the parser can be generated automatically from it. This is discussed in some later CS courses such as CS 251 and CS 254.
Our language is all about numbers. In a typical programming language there are other data types, such as strings, and many more built-in functions to operate on those data types.
A typical programming language allows for user-defined functions. This feature may cause a lot of syntax to be added to the language. More importantly, it necessitates a more sophisticated evaluation process. For example, suppose that I want the users of my language to be able to define functions like this:
The syntax isn't very bad; we just need to adjust the parser to treat>> f = (x return (x * x)) function f defined >> f(3) 9.0
return
as a binary operator. The parse tree is something like
Like[=, [f, [], []], [return, [x, [], []], [*, [x, [], []], [x, [], []]]]]
=
, return
is actually a keyword to be handled specially by the evaluator. Neither of its subtrees are evaluated. Instead, the whole parse tree rooted at return
is put into the environment as the value for the identifier f
. Let me repeat: The value of f
is a parse tree. When the user actually uses the function, as in f(3)
, the evaluator first evaluates the token 3
to the number 3.0 and f
to the parse tree. Then it evaluates the right subtree of that parse tree, but using the value 3.0 for x
instead of whatever value might be stored for x
in the environment. The local value of x
in the function call f(x)
eclipses (or shadows) whatever standing value x
might have.
Once you have basic user-defined functions working, other issues immediately arise. For example, how can users define functions that take more than one argument? How do you keep track of how many arguments a function takes? How do you incorporate multiple arguments into the evaluation process? If the language supports data types other than numbers, then how do you keep track of the types? This is all kind of tricky. I considered including some of it in this assignment, but then decided that the assignment was difficult enough.