2014 May 7,

CS 254: Day 15

Carleton College, Prof. Joshua R. Davis,

Introduction

In this assignment, we're going to implement a simple programming language together. Interpreting a statement in a programming language is done in three steps, which coincidentally (?) correspond to the three kinds of machines in this course:

The statement is given to the interpreter as a string. The interpreter scans the string into a list of tokens. Intuitively, each token is a word or punctuation mark, as defined by the language. This task is not complicated; it can be performed by a finite automaton.
The list of strings is parsed into a tree that reflects its grammatical structure, as defined by the syntax of the language. In particular, subexpressions correspond to subtrees. This task typically involves processing a context-free grammar, using a stack or recursion. So it's roughly at the level of a push-down automaton.
The tree is evaluated. Intuitively, this means that the work described by the tree is carried out. A value is computed and returned to the user or stored in a variable. This task may involve arbitrarily complicated computation and memory access. It requires a Turing machine.

You may have implemented a programming language in a previous course, such as CS 251. In our course, the purpose of the exercise is to emphasize the relationships among finite automata, push-down automata, and Turing machines. I've already written the evaluator and the user interface; see interpreter.py. You're going to write the scanner and parser. If you write your code to specification, then it will work seamlessly with my code.

The Language

Our language has only three data types. The first type is floating-point numbers. Numbers can be added, multiplied, and stored in variables, as the following transcript shows. (By the way, if you know Lisp or one of its offspring, then this language may look familiar, but it is definitely not Lisp.)

Desktop > python interpreter.py 
Welcome to So Super Language. Press Control-D to exit.
:> 7
7.0
:> (+ 2 -4.1)
-2.1
:> (set age 20)
20.0
:> (* age 2)
40.0
:> (* 15 40) (+ 100 7) age
600.0
107.0
20.0
:> ^D
Desktop >

The second type is strings. These are delimited by the single quotation mark ', and they cannot contain any 's. Strings can be concatenated, repeated, and stored in variables:

:> (+ 'computability' ' and complexity')
'computability and complexity'
:> (set name 'Turing')
'Turing'
:> (* 4 name)
'TuringTuringTuringTuring'

The third type is functions. These can be defined by the user, applied to arguments, and stored in variables. Functions are first-class: They can be passed to other functions as arguments, and returned from functions as output.

:> (set mult3 (fun (x) (* 3 x)))
(fun (x) (* x 3))
:> (mult3 20)
60.0
:> (mult3 'CS254')
'CS254CS254CS254'
:> ((fun (x) (+ 7 x)) 11)
18.0
:> (- 7 11)
error: unknown: -
:> (set - (fun (x y) (+ x (* -1 y))))
(fun (x y) (+ x (* -1 y)))
:> (- 7 11)
-4.0
:> (((fun (y) (fun (x) (* x y))) 7) 9)
63.0

In addition to implementing all of these features, we want our interpreter to handle as many kinds of user error as possible.

:> (+ 'python 'cobra')
error: syntax: too many or too few '
:> (+ 3 ())
error: syntax: empty list ()
:> (+ 3 6
error: syntax: unclosed (
:> (+ 3 6))
error: syntax: too many )
:> (* (2 age)) 4)      
error: syntax: too many )
:> (* 'bad' 'example')
error: *: usage (* num num ... num numOrStr)
:> ((fun (x y) (* x x y)) 3)
error: function: incorrect number of arguments to (fun (x y) (* x x y))

Scanning

The language distinguishes five kinds of symbols:

the left parenthesis (
the right parenthesis )
the single quotation mark '
the four white space characters \t (tab), \r (carriage return), \n (new line), and (space)
everything else; let's call these ordinary characters

There are four kinds of token in our language:

the left parenthesis (
the right parenthesis )
string literals, which are delimited by ' and which contain no '
atoms, which are strings of ordinary characters. Atoms are always scanned greedily, so that they are as large as possible. In other words, it is never the case that two atoms abut each other in the input string, with no intervening white space, parentheses, or '; if they did, then they would be combined into a single atom. Examples of atoms include judy, 3.14159, +, and 2frogj/%lobster.

Your first job is to write the scanner tokenList, based on the stub already present in interpreter.py. The function takes as input a string, representing a single line of user input. It returns a list of strings, namely all of the tokens in the user input, in order. For example, the string (* 11 val) 'so super' scans into the list of six strings ["(", "*", "11", "val", ")", "'so super'"]. If an error occurs in scanning, then an empty list is returned. Implement tokenList using a regular expression. (By the way, my solution totals five lines of code, including a check for mismatched '.)

Parsing

Once the scanning rules are out of the way, the rest of the language's syntax is extremely simple. Here's a context free grammar:

LIST -> EXPR | EXPR LIST
EXPR -> ATOM | STRING | (LIST)

The starting symbol is LIST. Intuitively, a list is a sequence of one or more expressions. An expression is either an atom, a string literal, or a list enclosed in parentheses.

The children of the root of the parse tree correspond to the expressions that the user types on the input line. Otherwise, the structure of the parse tree represents the nesting of the parentheses in each expression. For example, the input line (+ x 14) is parsed into this five-node tree:

        ROOTNODE
        |
BRANCHNODE
|   |   |
+   x   14

The input line ((fun (x) (* x x)) 3) (set y 4) is parsed into this 15-node tree:

             ROOTNODE
             |      |
    BRANCHNODE      BRANCHNODE
    |        |       |   |   |
BRANCHNODE   3      set  y   4
 |   |   |
fun  BR  BRANCHNODE
     |   |    |   |
     x   *    x   x

The interpreter.py file already contains a TreeNode class, which implements a simple tree data structure. Each node can have arbitrarily many children, and can also store an arbitrary Python object as its data "payload". All of the leaf nodes in the parse tree have strings as their data; these strings are exactly the atoms from the token list. The branch nodes' data objects are ignored by the rest of the interpreter, so you may store whatever you like in them.

Your second job is to write the parser parseTree, based on the stub in interpreter.py. This parsing function takes as input a list of strings — the token list produced by tokenList. If the token list does not conform to the grammar, then the parser prints an informative syntax error message and returns None. If the token list conforms to the grammar, then the parser outputs a TreeNode representing the root of the parse tree.

The parser does not need to know or check any details of the semantics of the language, such as how many arguments a given function or keyword takes. That check is performed later, by the evaluator. My solution to parseTree is 31 lines of code, most of which exist to handle syntax errors. My solution uses iteration on a stack. The stack stores all of the nodes along the path from the root node to the node where I'm currently working. If you prefer, you can use recursion, which is equivalent to iteration on a stack by our Recursion homework.

Evaluation

I've already written the evaluator, so you don't need to know anything about the semantics of the language. However, knowing what the language is supposed to do may help you debug.

When an atom is evaluated, we first ask the underlying Python interpreter whether that atom can be interpreted as a floating-point number. If so, then that number is the atom's value. If not, then the atom is inferred to be a variable. When a variable is evaluated, its value is looked up in the environment (the set of variables and their assignments).

When a compound expression (LIST) is evaluated, we require the head of the list to evaluate to a keyword or function. We require the remaining expressions in the list to evaluate to the correct number of values to be passed to that keyword or function. The language comes with built-in + and * functions, and there are two keywords:

The set keyword takes two arguments. The first argument must be an atom, which is not evaluated. The second argument is evaluated. The environment is updated, so that the value of the second argument is associated to the variable indicated by the first argument. Also, the value is returned.
The fun keyword takes two arguments. The first argument is a list of atoms. The second argument is an arbitrary expression, typically depending on the atoms in the first argument. That is, the two arguments define a parameter list and a function body. The fun-expression returns a function, which can be used as the head of a compound expression.

Warning: This weekend I plan to add an if keyword, along with functions ==, !=, >=, <=, >, <. This change should not affect your scanner or parser. Update: I decided not to.

In the following code snippet, is the result of the final statement 7.0 or 9.0? The answer depends on exactly how values are stored in and retrieved from the environment. This question gets at the subtle issue of scope, which we'll leave to a programming language course such as CS 251.

:> (set x 3)
3.0
:> (set f (fun (y) (+ x y)))
(fun (y) (+ x y))
:> (f 4)
7.0
:> (set x 5)
5.0
:> (f 4)

Test and Submit

Once you complete tokenList and parseTree, test the interpreter thoroughly. In particular, it should work on all of the example code given above. (Let me know if you find errors or bad behavior. I already know that there should be several more error checks in the evaluator. That is not your problem.) Submit your work electronically, by mounting the COURSES file server and dropping your edited interpreter.py file in your CS 254 handin folder.