Saturday, 25 October 2014

Compiler 2 Part 6: Scanner

Much like tokens, the scanner hasn't changed much but mainly gained some new functionality. This should be a short but sweet post.


To scan for an identifier, we detect if the current character in the scanner is a letter of some kind. This is per our specification for an identifier. If a letter was found, we scan for an identifier of some kind and return the result.

A small change to the next section allows us to scan two characters at a time for our comparison operators. Consider assignment verses equality. One operator has a single equals sign and the other a double equals sign. After storing the current character in a temporary variable, the scanner is advanced again so we can look ahead by one character.

Once we find a valid token and determine that it might be one of two different values, we call selectToken to determine which token we want. Since the scanner has already advanced to the next character, we test to see if it matches a second character. If it matches, we return the first token and advance the scanner again. Otherwise, we return the second token.


The next function had a bug in Calc 1. A newline was never recorded as a result of the bug so the Position of a character was never properly reported. Since most of the source code passed to the Calc 1 scanner was always on a single line the bug wasn’t caught until after the compiler was released.

The bug has since been fixed and back-ported to Calc 1.


Like scanNumber, scanIdentifier continues to advance the scanner until the end of an identifier is found. As long as the first character of an identifier is a letter, the remaining characters can be a digit or a number.

The next part of the code is a little quirky. It is there to protect against the end of a file in the middle of an identifier, which is obviously an error. It also is needed when an identifier is alone on a line because it is the return value for a function.

Last, it makes a call to token.Lookup. This code, if you recall, determines whether the identifier is actually a keyword or just an identifier. That’s why we had to change the default return value of Lookup to the IDENT token, rather than ILLEGAL.


Again, not a lot to do here. There’s nothing particularly amazing in the additions.

On to parsing and the abstract syntax tree! This is where the major changes appear!