2013 April 16,

CS 254: Day 08

Carleton College, Spring 2013

Prof. Joshua R. Davis, , CMC 325, x4473

Introduction

In this assignment, we're going to implement a simple programming language together. You may have done an exercise like this in a previous course, such as CS 251. In our course, the purpose of the exercise is to demonstrate that scanning is a problem for finite automata, parsing is a problem for pushdown automata, and evaluation is a problem for Turing machines.

Our language has only two data types: floating-point numbers and functions. The user can add and multiply numbers. The user can construct her own functions using the fun keyword. The user can assign values to variables using the set keyword. And that's about it. So the language is quite limited and simple. It does have one sophisticated feature: Functions are first class in this language — they can be assigned to variables, they can be passed to other functions as input, and they can be returned by other functions as output.

Here's a transcript. The user input lines begin with the prompt :>. For each user input, the interpreter prints its response, or an error message. The user may enter more than one statement on a single input line. If you know Lisp, then this language may seem familiar, but it is definitely not Lisp.

Desktop > python interpreter.py 
Welcome to So Super Language. Press Control-D to exit.
:> 7
7.0
:> (+ 2 -4.1)
-2.1
:> (set age 20)
20.0
:> (* age 2)
40.0
:> (set mult3 (fun (x) (* x 3)))
(fun (x) (* x 3))
:> (mult3 age)
60.0
:> ((fun (x) (+ 7 x)) 11)
18.0
:> (- 7 11)
error: unknown token: -
:> (set - (fun (x y) (+ x (* -1 y))))
(fun (x y) (+ x (* -1 y)))
:> (- 7 11)
-4.0
:> (((fun (y) (fun (x) (* x y))) 7) 9)
63.0
:> (* 15 40) (- 100 7) age
600.0
93.0
20.0
:> (+ 3 ())
error: syntax: empty list ()
:> (+ 3 6
error: syntax: unclosed (
:> (+ 3 6))
error: syntax: too many )
:> (* (2 age)) 4)      
error: syntax: too many )
:> ^D
Desktop >

Remember from class that interpreting a statement in a programming language consists of three basic steps:

  1. The statement is given to the interpreter as a string. The interpreter scans the string into a list of tokens. Intuitively, each token is a word or punctuation mark, as defined by the language.
  2. The list of strings is parsed into a tree that reflects its grammatical structure, as defined by the syntax of the language. In particular, subexpressions correspond to subtrees.
  3. The tree is evaluated. Intuitively, the work described by the tree is carried out. A value is computed and returned to the user or stored in a variable. The set of variables and their values is called the environment. When the tree being evaluated contains variables, their values are looked up in the environment.

I've already written the evaluator and the user interface; see interpreter.py. You're going to write the scanner and parser. If you write your code to specification, then it will work seamlessly with my code.

Scanning

The language distinguishes four kinds of symbols:

There are three kinds of tokens in our language: left parentheses, right parentheses, and atoms. An atom is a string of ordinary characters. Atoms are always scanned greedily, so that they are as large as possible. In other words, it is never the case that two atoms abut each other in the input string, with no intervening white space or parentheses; if they did, then they would be combined into a single atom. Examples of atoms include judy, 3.14159, +, and 2frogj/%lobster.

Your first job is to write the scanner tokenList, based on the stub already present in interpreter.py. The function takes as input a single line of user input, as a string. It outputs a list of strings, namely all of the tokens in the user input, in order. For example, the string (* 15 40) (- 100 7) age scans into the list of strings ["(", "*", "15", "40", ")", "(", "-", "100", "7", ")", "age"]. Implement tokenList using a regular expression. My solution totals two lines of code.

Parsing

Once the scanning rules are out of the way, the rest of the language's syntax is extremely simple. Here's a context free grammar:

The starting symbol is LIST. Intuitively, a list is a sequence of one or more expressions. An expression is either an atom (in the sense of the previous section) or a list enclosed in parentheses.

The children of the root of the parse tree correspond to the expressions that the user types on the input line. Otherwise, the structure of the parse tree represents the nesting of the parentheses in each expression. For example, the input line (+ x 14) is parsed into this five-node tree:

        ROOTNODE
        |
BRANCHNODE
|   |   |
+   x   14

The input line ((fun (x) (* x x)) 3) (set y 4) is parsed into this 15-node tree:

             ROOTNODE
             |      |
    BRANCHNODE      BRANCHNODE
    |        |       |   |   |
BRANCHNODE   3      set  y   4
 |   |   |
fun  BR  BRANCHNODE
     |    |   |   |
     x    *   x   x

The interpreter.py file already contains a TreeNode class, which implements a simple tree data structure. Each node can have arbitrarily many children, and can also store an arbitrary Python object as its data "payload". All of the leaf nodes in the parse tree have strings as their data; these strings are exactly the atoms from the token list. The branch nodes' data objects are ignored by the rest of the interpreter, so you may store whatever you like in them.

Your second job is to write the parser parseTree, based on the stub in interpreter.py. This parsing function takes as input a list of strings — the token list produced by tokenList. If the token list does not conform to the grammar, then the parser prints an informative syntax error message and returns None. If the token list conforms to the grammar, then the parser outputs a TreeNode representing the root of the parse tree.

The parser does not need to know or check any details of the semantics of the language, such as how many arguments a given function or keyword takes. That check is performed later, by the evaluator. My solution to parseTree is 31 lines of code, most of which handle syntax errors. My solution uses iteration on a stack. If you prefer, you can use recursion, which is equivalent to iteration on a stack by our Day 06 homework.

Evaluation

I've already written the evaluator, so you don't need to know anything about the semantics of the language. However, knowing what the language is supposed to do may help you debug.

When an atom is evaluated, we first ask the underlying Python interpreter whether that atom can be interpreted as a floating-point number. If so, then that number is the atom's value. If not, then the atom is inferred to be a variable. When a variable is evaluated, its value is looked up in the environment (the set of variables and their assignments).

When a compound expression (LIST) is evaluated, we require the head of the list to evaluate to a keyword or function. We require the remaining expressions in the list to evaluate to the correct number of values to be passed to that keyword or function. The language comes with built-in + and * functions, and there are two keywords:

In the following code snippet, is the result of the final statement 7.0 or 9.0? The answer depends on exactly how values are stored in and retrieved from the environment. This is the subtle issue of scope. Let's not get into it.

:> (set x 3)
3.0
:> (set f (fun (y) (+ x y)))
(fun (y) (+ x y))
:> (f 4)
7.0
:> (set x 5)
5.0
:> (f 4)

Test and Submit

Once you complete tokenList and parseTree, test the interpreter thoroughly. In particular, it should work on all of the example code given above. (Let me know if you find errors or bad behavior. I already know that there should be several more error checks in the evaluator. That is not your problem.) Submit your work electronically, by mounting the COURSES file server and dropping your edited interpreter.py file in your CS 254 handin folder.