Saturday, 3 May 2014

Compiler Part 6: Tokens

Part 1: Introduction
Part 2: Compilers, Transpilers and Interpreters
Part 3: Overview of Compiling
Part 4: Overview of Language Design
Part 5: Calc 1 Language Specification

In this post we can finally start to lay down some code!


Previously, we discussed our grammar and the set of tokens we’ll need to scan for. We defined an expression, a number and operators. We also identified we expect to encounter some parentheses. We should also let our parser know when the scanner has reached the end of the file.

Before we can start scanning we need to formalize our code for the tokens our scanner will be emitting. Our tokens are going to be used in almost all stages of our compiler. If we want to create tools like Go’s fmt or vet programs we might need to reuse these tokens.

Here’s the first section of code:

The constants might look a bit funny at first. There are some lower-case, non-exported tokens intermixed with upper-case, exported ones. The non-exported tokens will help us with writing some utility functions and allow the language to expand without modifying any other code.

Next comes a mapping of Tokens to strings. I could also have used an array of strings but I didn't. It makes writing the Lookup function easy.

The rest are some utility functions. You can see with IsLiteral and IsOperator where our non-exported constants come in handy. No matter how many new operators or literals we add we never have to adjust these functions. Handy!

Lookup, String and Valid help us when generating error messages.


This file might take a moment to wrap your head around. I’ll try and take it slow.

When scanning, we start at the first character in the stream, starting at the top and working down, moving left to right. The first character is at offset zero.

By comparison, when reporting an error a user wants to know the line and column at which the error occurs. The first character would be line one, column one. We therefore have to translate the position of the character offset into something meaningful to the end user.

A position, Pos, is the character offset plus the file’s base. With a base of one, a character at offset zero has a Pos of one.

It is illegal to have a position of zero because it would be outside of the file. Consequently, it would also be illegal to have a position greater than the base of the file plus it’s length.

Why something so complicated you might be wondering? Well, once we need to parse multiple files it can be quite confusing to identify in which file an error occurred without a lot of overhead. The Pos makes it easy. More on this in the next series.

The Position type is strictly for error reporting. It allows us to output clear information about which line and column the error happened at and in which file it occurred. At this stage we’ll only ever be dealing with a single file but we’ll be grateful for this code later.


File is, strictly speaking, completely unnecessary for writing a compiler but I feel that good, clear errors messages are vitally important. Go does a pretty good job but many compilers don’t. The GNU C compiler, for instance, used to be horrid. It has improved a lot over the years.

It provides a framework for something yet to come. This code is really just laying the groundwork for when we have to process more than a single file.

It’s sole purpose is essentially for error reporting and goes hand-in hand with the position code. Again, since we only have one file the base, or start, position is one. It can’t ever be less but in the future it could be higher. Don’t worry about that now, though.

Each time the scanner detects a newline character, we want to add that position to the list of newlines in the file. This allows the Position function to accurately calculate where an error occurred and to report it’s position.


That about covers the token section. As promised, I've not talked much about the support code.

I suggest referring back to this code often. Once you begin to understand how it all works together it will make a lot more sense. This library gets used pretty liberally throughout the compiler so we'll be coming back to it often.