Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Web Development

Prototyping Interpreters using Python Lex-Yacc


Mar04: Prototyping Interpreters Using Python Lex-Yacc

A little language with powerful features

Shannon is a developer in the IPv6 Group of NTT MCL Inc. He can be contacted at [email protected].


Interpreters such as Python, Perl, PHP, and Ruby are usually written in C for performance reasons. However, prototyping interpreters in Python using the Python Lex-Yacc (PLY) toolset makes it possible for you to create fully working interpreters in surprisingly few lines of code. To test out the Python and PLY environment, I wrote a language called "Squeamish" that consists of only 850 lines of code (including regression tests), yet is powerful enough to implement an IPv6-compatible time server. As a bonus, Squeamish is a fairly lazy evaluated functional programming language. (The complete source code for Squeamish is available electronically; see "Resource Center," page 5.)

Python offers several benefits that make this possible. Python has a dynamic type system that supports runtime type detection. Using Python's type system for your interpreter instead of inventing your own is a huge win. Python's lists make it easy to implement parse trees. Since all the list operators in Python are accessible using simple syntax, the code to work with the parse trees is often "cleaner" looking than code in languages such as C or Java. Python, and similarly Jython, make it easy to provide access to the Python and Java libraries from your new interpreter without requiring special bindings.

In this article, I focus on prototyping interpreters using Python and PLY. However, this toolset could easily be used to write compilers as well. Simple syntax features such as string interpolation and Python's triple quotes for quoting multiple line strings are convenient when creating compilers. Furthermore, compilers do not have the performance constraints that interpreters do, so it is reasonable to write a production compiler in Python, instead of just a prototype. Nonetheless, I was still surprised at how easy Python made it thanks to tricks such as reusing Python's type system and libraries.

Python Lex-Yacc

The front end of the interpreter is implemented using PLY. Not surprisingly, PLY is a rewrite in Python of the familiar Lex and Yacc. Lex is a lexer responsible for taking a stream of characters and outputting a stream of tokens such as ints, strings, keywords, and symbols. On the other hand, Yacc is a parser responsible for taking a stream of tokens and recognizing sets of tokens as language constructs such as variable declarations, function definitions, and if statements.

PLY was originally developed by David Beazley for an introductory class on compilers at the University of Chicago. Beazley required his students use PLY to develop a simple Pascal-like language that included lexical analysis, parsing, type checking, type inference, nested scoping, and generation of assembly code for the SPARC processor. This led PLY to be a mature project. Furthermore, PLY has extensive error checking facilities to catch common mistakes. Unlike Yacc in C, which uses LALR-parsing, PLY uses LR-parsing, which is reasonably efficient and well suited for larger grammars, but slightly restricts the types of grammars that can be successfully parsed. Furthermore, unlike Lex and Yacc in C, PLY relies on Python reflection, so there are no separate Lex and Yacc files—everything is in Python. PLY is free software and may be distributed under the terms of the Lesser GPL (http://systems.cs.uchicago.edu/ply/).

Python Lex

The first step to implementing an interpreter is to write the tokenizer using Python Lex; see Listing One. Tokens are recognized using regular expressions. The steps are straightforward:

1. The names of all of the valid token types are declared as a list of strings named tokens.

2. Tokens that require no special processing are declared using module-level variables named t_TOKEN_TYPE where TOKEN_TYPE matches some string in the "tokens" list. Each such variable contains a regular expression string that matches the respective token (Python raw strings are usually used since they are the most convenient way to write regular expression strings).

3. Tokens that do require special processing are declared using module-level functions that follow the same naming convention. Each such function has a doc string that is a regular expression that matches the respective token. The order of these functions is important since each regular expression is tried in order. Hence, functions with more complex regular expressions (floats, for instance) should come before functions with less complex regular expressions (ints, for example). Each function receives a value of type LexToken, modifies it, and usually returns it. Unlike Lex in C, which uses unions, the returned value has an attribute named "value" that can be set to a value of any type (even an instance). This flexibility is convenient. If None is returned or if no return is encountered, the lexer ignores the current token and moves on to the next token. This can be used to implement whitespace handling, but a module-level variable named t_ignore is more efficient for this task. Having defined the tokenizer rules, lex.lex() is used to build the tokenizer. A main (for instance, if __name__ == '__main__':) is included that tests the functionality of the tokenizer using regression tests. This is good programming practice in Python, but is almost essential when writing a language. (The complete source code for main is available electronically; see "Resource Center," page 5).

Python Yacc

The next step to implementing an interpreter is to write the parser using Python Yacc; see Listing Two. After importing the list of tokens from the tokenizer module written earlier, a series of functions are defined. Example 1 shows one such function that recognizes an expression in parenthesis and states that the value of the entire expression is equal to the value of the expression in the parenthesis. Each function has a doc string that contains the appropriate context-free grammar specification (this idea was borrowed from John Aycock's SPARK toolkit; http://pages.cpsc.ucalgary.ca/~aycock/spark/). Each function takes a single argument, t, that is a sequence of values (starting at index 1) matching the respective symbols in the doc string. These values are the same values that were stored in t.value attributes in the tokenizer. The body of the function does whatever processing is appropriate and places the result in t[0].

For a simple program (such as a calculator that parses an expression and returns its value), the body of the function might actually do real work. However, more advanced compilers and interpreters will usually use these return values to build a parse tree. Just as with Python Lex, this returned value can be of any type, including an instance. Hence, constructing parse trees as nested lists of ints, floats, instances, and so on, is straightforward. For instance, in Listing Two, while the function p_node_int simply returns an int, the function p_nodes_node returns a list. It is important to remember that in p_nodes_node, t[1] and t[2] might be lists themselves, so it is important to have a recursive mindset. Having defined the parser rules as a set of functions, yacc.yacc() is called. This causes Python Yacc to attempt to construct the LR parsing tables for the grammar specified in the given parser rules. Once again, a main is used (not shown) to test the functionality of the parser using regression tests.

Implementing the Interpreter Back End

The interpreter back end is built around two entities—the evaluate function in the interpreter module and the symbol table in the Environ class. Understanding these is the key to understanding the entire back end.

Because Squeamish is a Lisp-like language, the evaluate function is based on the idea that every statement is a function call of the form:

(function arg1 arg2 ...)

Even constructs such as "=" and if statements are treated in this manner. This makes the parser amazingly simple because most of the work is pushed into the back end. Normally, the parser must be aware of special constructs such as "=" and if statements. Nonetheless, there are distinctions among the different types of functions. After all, user-defined functions must be handled differently than built-in functions. (Imagine trying to implement a user-defined "+" function without the aid of a built-in "+" function. It's possible—but definitely not pleasant!) I call the most low-level functions language constructs, such as "=", def (function definition) and del (the opposite of "="), which are handled directly by the interpreter module itself. Constructs for working with Python modules such as pyimport, "." (getattr), and so on, are handled in the python_constructs module because they are not a core part of the Squeamish language.

Next, just as the stack must be manipulated when a C function call is made, work is necessary when calling user-defined functions. This work is done by the function module. Support for calling built-in functions is located in the builtin_function module, whereas the actual list of built-in functions such as "+" and if are implemented in the builtins module. The external_function module handles directly called Python functions since Python functions have no concept of lazy evaluation.

In contrast to the evaluate function, the Environ class is simpler—but just as crucial. It is a subclass of Python's UserDict class (this alone shows how convenient an environment Python provides for implementing symbol tables in contrast to C). It is directly responsible for implementing such features as dynamic scoping and lazy evaluation.

Implementing Dynamic Scoping

Dynamic scoping is a language feature that permits access to a calling function's namespace. In Perl, dynamic scoping is possible using the local keyword. Dynamic scoping is not as friendly as lexical scoping, yet is easy to implement. Furthermore, it makes parameter passing trivial to implement.

Dynamic scoping is implemented by the Environ class. Each Environ instance has a parent attribute pointing to its parent Environ. When lookups are done, if a symbol is not found in the current Environ, the parent is searched recursively. At the top of the hierarchy is the list of built-ins, so they are always available. Within the Squeamish code, it is necessary to pass around the current Environ instance just as it is necessary to pass around the locals and globals dicts in Python. Since this single instance of Environ contains the locals, globals, and everything in between, it is only necessary to pass around this one instance. When a user-defined function call is made, a new Environ instance is pushed onto the "Environ stack" just like stack frames work in C. In this new Environ, bindings are made for all of the parameters. When the call is finished, the Environ stack is popped.

At this point, it is interesting to consider the implementation of the "=" construct, which sets a symbol binding in the current Environ. However, when a function call is finished, this symbol binding is thrown away along with the Environ it was a member of. There is no way to alter a caller's namespace directly (although calling methods on instances within the caller's namespace may have side effects). This is by design and part of the nature of a functional programming language.

Implementing Lazy Evaluation

Squeamish supports lazy evaluation of function arguments. This means that an argument to a function is not evaluated until it is actually used by the function. This is even true recursively. (Lazy evaluation is not used when directly calling Python functions, since Python has no concept of lazy evaluation. Normal parameter passing semantics are used instead.)

Like dynamic scoping, lazy evaluation is primarily implemented by the Environ class. When a user-defined function is about to be called, an Environ instance is pushed onto the Environ stack. Each argument to the function is encapsulated in an Expression instance when being placed into the new Environ. Each Expression instance is not evaluated until the corresponding parameter is actually used. When the corresponding parameter is actually used, the new Environ responds to the referenced parameter by evaluating the Expression instance. The value of this Expression is cached, which matches the expected behavior of an expression passed as a function parameter. Listing Three (also available electronically) contains the Environ class.

Implementing Functions Using Python Closures

Again, the evaluate function is based on the idea that every statement is a function call. There are several types of function calls, each requiring its own special calling convention. For instance, user-defined function calls must undergo the Environ stack manipulation. Built-in functions must have each argument wrapped in a Parameter instance so that built-in functions can take advantage of lazy evaluation from code written in Python. External Python functions must have each argument evaluated before the function call is made, because native Python code does not support lazy evaluation. Such a variety of setup procedures could make the evaluate function an ugly, complicated mess. Object-oriented design suggests that each of these special needs should be separated into its own class, and each such class should match a common interface. Python provides an interface named callable that seems appropriate. This is convenient in that a function in Squeamish is wrapped in an instance that makes that function callable directly from Python. Perhaps the most obvious way to do this is to have each class implement the __call__ method. Instances of such a class would behave as functions; however, it is not possible to use such instances as instance methods. I had hoped to make it possible to subclass Python classes in Squeamish (just as Jython can subclass Java classes), but this approach didn't pan out. Alex Martelli ([email protected]) suggested that I take advantage of Python 2.2's new support for closures. In effect, creating a closure permitted me to return something that implemented the callable interface, maintain any state that was needed, as well as wrap Squeamish functions in something that could be used as an instance method. This fit my needs perfectly. Listing Four (available electronically) contains the function module that is used to create closures for user-defined functions. Aside from showing the use of Python closures, Listing Four also shows the Environ stack manipulations that have been mentioned in previous sections.

Conclusion

There are some substantial benefits gained by prototyping interpreters using Python and PLY compared to trying to implement a minimal prototype interpreter in C. Directly using Python's type system, using Python lists to implement parse trees, and providing direct access to Python and Java (via Jython) libraries are all helpful in quickly creating a useful interpreter. Neither Python nor PLY require a separate compilation step as C, Lex, and Yacc do, so the edit/compile/run cycle is streamlined to just edit/run. Choosing to implement lazy evaluation also had novel benefits. For instance, Squeamish does not require special syntax for the if construct—it is provided as a built-in function.

Although mostly positive, my experiences with Python and PLY were not perfect. The lexer can only read input from a string instead of a stream (this is mostly a limitation of Python's regular expression library). Thus, to implement an interactive shell, the Shell class has to count the parenthesis in incoming lines in order to know when a statement is finished before passing anything to the interpreter. Furthermore, constructing the parser tables can require quite a lot of memory. In fact, while using Jython, I found it necessary to construct the parser tables using Python first. Otherwise, I would run out of memory on my 256-MB laptop.

To be fair, when running Squeamish under Jython, Squeamish is acting as a fifth-generation language (assembly -> C -> Java -> Jython -> Squeamish)! My choice to implement lazy evaluation had drawbacks as well. I was not able to subclass Python classes in Squeamish since the closures that wrap Squeamish functions expect the Environ stack to be passed as the first parameter when called from Python. Such a requirement was untenable since Python instance methods require self to be passed as the first parameter.

I have learned that there is a tradeoff between the ease of reusing Python functionality and the flexibility of implementing such functionality by hand. Finally, I would like to remind you of the importance of iterative design and regression testing in a project such as this. Brian Kernighan (coauthor of C) and Rob Pike underscored the importance of iterative design while implementing a simple programming language named HOC in The UNIX Programming Environment (ISBN 0-13-937681-X), and any Extreme Programming advocate can preach about the benefits of regression testing. Being able to quickly and concisely construct a parse tree by hand and having all regression tests for a module inside that module (using the if __name__ == '__main__': construct) permit Python to remove much of the pain associated with regression testing. Listing Five (available electronically) contains the regression tests performed at the bottom of the interpreter module.

DDJ

Listing One

"""This file contains the lexer rules and the list of valid tokens."""
import lex
import sys
import re

# This is the list of token names.
tokens = (
    'INT', 
    'FLOAT', 
    'STRING',
    'SYMBOL',
    'LPAREN', 
    'RPAREN'
)
# These are regular expression rules for simple tokens.
t_LPAREN    = r'\('
t_RPAREN    = r'\)'

# Read in a float.  This rule has to be done before the int rule.
def t_FLOAT(t):
    r'-?\d+\.\d*(e-?\d+)?'
    t.value = float(t.value)
    return t
# Read in an int.
def t_INT(t):
    r'-?\d+'
    t.value = int(t.value)
    return t
# Read in a string, as in C.  The following backslash sequences have their 
# usual special meaning: \", \\, \n, and \t.
def t_STRING(t):
    r'\"([^\\"]|(\\.))*\"'
    escaped = 0
    str = t.value[1:-1]
    new_str = ""
    for i in range(0, len(str)):
        c = str[i]
        if escaped:
            if c == "n":
                c = "\n"
            elif c == "t":
                c = "\t"
            new_str += c
            escaped = 0
        else:
            if c == "\\":
                escaped = 1
            else:
                new_str += c
    t.value = new_str
    return t
# Ignore comments.
def t_comment(t):
    r'[#][^\n]*'
    pass
# Track line numbers.
def t_newline(t):
    r'\n+'
    t.lineno += len(t.value)
# Read in a symbol.  This rule must be practically last since there are so few
# rules concerning what constitutes a symbol.
def t_SYMBOL(t):
    r'[^0-9()][^()\ \t\n]*'
    return t
# These are the things that should be ignored.
t_ignore = ' \t'
# Handle errors.
def t_error(t):
    raise SyntaxError("syntax error on line %d near '%s'" % 
        (t.lineno, t.value))
# Build the lexer.
lex.lex()

Back to Article

Listing Two

"""This file contains the parser rules.

The function yacc.parse, which this function makes available, returns a parse 
tree.  The parse tree is a set of nested lists containing ints, floats, 
strings, Symbols, etc.
"""
import yacc
import sys

from lexer import tokens
from Symbol import Symbol

def p_list(t):
    'list : LPAREN nodes RPAREN'
    t[0] = t[2]
def p_nodes_node(t):
    'nodes : node nodes'
    t[0] = [t[1]] + t[2]
def p_nodes_empty(t):
    'nodes : empty'
    t[0] = []
def p_empty(t):
    'empty :'
    pass
def p_node_int(t):
    'node : INT'
    t[0] = t[1]
def p_node_float(t):
    'node : FLOAT'
    t[0] = t[1]
def p_node_string(t):
    'node : STRING'
    t[0] = t[1]
def p_node_symbol(t):
    'node : SYMBOL'
    t[0] = Symbol(name = t[1])
def p_node_list(t):
    'node : list'
    t[0] = t[1]
# Error rule for syntax errors.
def p_error(t):
    raise SyntaxError("invalid syntax")
# Build the parser.
yacc.yacc()

Back to Article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.