Saturday, 3 May 2014

Compiler Part 8 - Abstract Syntax Tree

Part 1: Introduction
Part 2: Compilers, Transpilers and Interpreters
Part 3: Overview of Compiling
Part 4: Overview of Language Design
Part 5: Calc 1 Language Specification
Part 6: Tokens
Part 7: Scanning

Before we can build the parser we need to take a detour and talk about how we’re going to handle the data we’re going to process.

We need to use some kind of abstract data type to hold all the data we’re parsing. As it turns out, a tree data structure is perfect for our needs. This tree describes the syntax of our programming language and is appropriately called the Abstract Syntax Tree (AST).

AST


Tree data structures always start with a root and ours is no different. Typically, in a full-fledged compiler, you’d probably have an object which represents a package or program. In our case, we only have one file so we’ll have an object called File.

The rest of the objects we need to create can be found in our grammar blueprint. I refer you, again, back to Part 5 where we created our grammar rules. This handy layout tells us exactly what we need. Aren’t you glad we created it? It’s also going to come in very handy when we start parsing and syntax analysis.

The other objects we’ll need to create: an expression and an integer.

Go Interfaces


Before we create our objects we need to consider what they’re going to look like. What is it we know?

We know that our scanner is going to provide us with three things:

  • The literal string of the element we've scanned. A number might be “42”.
  • The token representing the element. An opening parenthesis would be token.LPAREN.
  • The position at which the element starts. A number between the file’s base and base + size.

One thing that will be common to all objects in our tree is their position. Where do they start and where do they end? Error reporting, for example, won’t care what kind of object it has. It only cares what it’s position is. So, we create a Node interface which has two methods: Pos, for the start position, and End, for the end position.

Next, there is a recurring element in our grammar rules: expressions. An expression can either be an integer or a sequence of elements encapsulated in parentheses. The interface will be called an Expr. Any Expr objects must implement the method exprNode as well as fulfill the Node interface.

We can now pass our objects around between different parts of our compiler based on their interface and use type assertions to discover their exact type, if required. When we get to parsing, this ought become more clear.

Objects


There are three we have to implement as well as one extra helper object:

  1. Integer
  2. Expression
  3. File

We’ll start with the Integer. We have to be a bit forward thinking with our objects and the integer type is no different. We could spend a lot of time creating an object for each type in our language. We could, however, just create a single object to hold any basic type.

BasicLit is just such an object. Whether we have an integer, float, string, or character it doesn't matter. This object satisfies our needs for all of them. Of course, we only have integers at the moment but eventually we want to add more. This object holds the token telling us what type of object it is, the position where the literal starts and the string representing the literal.

Skipping over the Expression for a moment, we’ll move on to File. This object is even simpler. This is where it all starts. It’s our starting point and has one field, Root. Looking back at our grammar rules we can see that a file holds an expression and that expression is the root of everything else.

Our expressions are of a specific type right now. They’re binary expressions. Mathematics. We have several elements we have to keep track of: An operator, opening and closing parentheses and two or more operands which are, in of themselves, expressions.

A binary expression won’t always be our only type of expression. Eventually, we’ll want more. One thing that will be common to every expression in our language is an opening and closing paren. That’s where the generic Expression object comes in handy. It’s only job is to handle the parens. This object will be nested into all our sub-expressions.

The beauty of Go is that any objects with an embedded  type sort of “inherit” the nested object’s methods. It’s not true inheritance but rather the compiler being clever. Regardless, the result is the same for our purposes.

As already stated, the BinaryExpr must have an operator and a list of operands. We have fields that hold the operand and where it is located is the sources. Our operands are held in the field List which is a slice of objects which fulfill the Expr interface.

AST Utilities


If we need to inspect our AST, a method of walking the tree would be handy. The simple Walk function does just that.

Printing the AST is done via the Print method which merely Walks the tree and prints out some information about each node in order.

Moving On


We’ll finally get to parsing soon but I want to be sure the code for the AST is clear. There’s that nasty recursive definition of an Expression and likely some methods that don’t provide any context to their use. Make sure you review this article and any previous posts before moving on.

Thankfully, parsing ought to make things more clear how all the pieces fit together.