Saturday, 3 May 2014

Compiler Part 7 - Scanning

Part 1: Introduction
Part 2: Compilers, Transpilers and Interpreters
Part 3: Overview of Compiling
Part 4: Overview of Language Design
Part 5: Calc 1 Language Specification
Part 6: Tokens

Now, we can finally work on the scanner.

Lexical Analysis


So, where do we start?

That was the hard part. Scanning, to me, seemed like it should be easy but I quickly got lost in the details. There are many ways to implement a scanner and I’ll show you how to do one type. Rob Pike did a great presentation on another cool method in a talk called: Lexical Scanning in Go.

The basic concept of this scanner is that we start at the top, moving left to right, until we reach the end of the source code. Each time we find an element of interest we’ll report the literal string found, a token telling the parser what it is and the position at which it was found.

Finite State Machine

Now, I’m not going to go into any real detail about things like a Finite State Machine (Finite Automata) or anything. You should investigate this yourself. Coursera has a course on compiler design you may find of help, too, that covers this topic. The concepts are important but its not strictly necessary to know every little detail (though I do encourage you to learn them).

The basic idea here is that we have a finite number of states which our scanner can return. These states are represented by tokens and since we can only return the limited tokens we’ve defined, we can say that our scanner has finite state. Hence, finite state machine. Understanding automata comes in handy when understanding regular expressions and/or when accepting or rejecting an individual character being scanned.

This will all become clear shortly.

Oops


I’d like to make it very clear one of the mistakes I made on my first attempt at writing a scanner. Writing any part of the compiler without some kind of solid definition for your language is a terrible idea. If the core design of your language is still fluid then you’re going to be rewriting your compiler a lot. And I mean, a lot. I had to rewrite my first attempt at an interpreter several times, pretty much from scratch, each time I revised my language. A complete waste of time.

It was this process that finally made me realize how bad decisions are made. What at first seems like a good idea might turn out to be a poor idea later but not making a decision at all will end up being disastrous. Many times I've criticized a language’s design asking myself, “Why in heck would you have done THAT? It’s so stupid!” Hindsight is 20/20 my friends.

The Scanner

The scanner is fairly simple. We start with a simple object which tracks things like the current character scanned, the offset from the beginning of the file, something we’ll call the reading offset, the source code being scanned and a pointer to the details about the file itself.

The first step to scanning is to initialize the scanner via Init. There’s nothing very special here outside the call to the next method, which I call “priming the pump.”

The next method is a funny little function. It first resets the current character to zero to indicate the end of the file. If the reading offset is less than the length of the file then the offset is changed to the reading offset. If a newline is encountered we make note of it’s location in the file object. Note that we discard the newline but still note its location. Finally, we update the current character and increment the reading offset.

Reading Offset and Unicode

What’s the deal with it? It mainly has to do with Unicode. A single character may take one or more bytes so you can’t just increment the offset by one each time. The UTF8 package’s DecodeRune function returns the number of bytes in the next character. The reading offset is used in that situation to mark the start of the next rune to read.

While this scanner won’t be Unicode friendly we can still start incorporating some of the functions we’ll end up needing so that we’ll have less work to do later when we add it. We’ll also be using the IsDigit and IsSpace functions from the unicode package.

Scan


This puppy is the meat and potatoes of the scanner. The Scan method starts by skipping any whitespace. The method skipWhitespace simply advances the scanner one character at a time until it reaches a non-whitespace character. I've utilized the unicode.IsSpace function to achieve this goal.

Next, it looks for multi-character elements. In this case, we are just looking for numbers. After that, we look for single character elements and finally wrap things up by reporting either an illegal character or the end of the file.

After each pass we always need to increment the scanner to the next character via a call to next and return the results of the scan.

We should also have our handy language spec at our sides. It tells us exactly what to do. Head back to Part 5 if you need to find it again.

Integers


If we encounter a digit, we scan for a longer sequence via scanNumber.

I've chosen to use the unicode.IsDigit function but we could have just as easily written our own implementation. Something as simple as: return s.ch >= ‘0’ && s.ch <= ‘9’ would have sufficed. scanNumber works by continually advancing the scanner until a non-digit is found. The if statement after the loop handles the instance that a number occurs at the end of a file.

In a later version of Calc, or on your own, I will expand this function to include various other numerical forms like floating point and hexadecimal.

More Scan


If a digit was not found, we move on and check for single characters. These should all be pretty self-explanatory with maybe the exception of the semi-colon and default.

Comments start at a semi-colon and go to the end of the line or end of the file. In the current incarnation of the scanner comments are discarded but it would be just as simple to keep them. For instance, the Go scanner reports any comments it finds and passes them along to the parser so that the parser can create those wonderful Go docs we all know and love!

Wrap Up


That’s all there is to the scanner. It’s fairly straight forward.

On to parsing!

[Edit: Add link to Rob Pike's talk on Lexical Scanning in Go]