Saturday, 3 May 2014

Compiler Part 9: Parsing

Part 1: Introduction
Part 2: Compilers, Transpilers and Interpreters
Part 3: Overview of Compiling
Part 4: Overview of Language Design
Part 5: Calc 1 Language Specification
Part 6: Tokens
Part 7: Scanning
Part 8: Abstract Syntax Tree

We've come a long way so far. We've covered the basic principles of scanning and the abstract data tree. Now, we can finally move on to parsing.

If you’re struggling to keep up at the point then I’ll warn you that it just gets harder from here on out. Parsing is probably the hardest part to wrap your head around conceptually. We will be taking the lexemes our scanner detects, give them meaning and storing the resulting objects into the AST.

Make sure you understand the previous material before moving on.

The Parser Object


The parser for our little language is pretty simple. Like our scanner, we have a pointer to the information on the file we’re scanning. We have a way to track errors and a scanner object.

The last three fields directly reflect the same three values returned by the scanner: the position, the token and the literal string.

Entry Point


ParseFile is where the magic begins. We initialize the parser and begin scanning. If one or more errors are encountered we stop and print out the errors. Otherwise, we return the file object, which is the entry point into the program.

Just like with the scanner, init primes the pump to get things started. Any calls to any other parser functions will advance the parser and scanner and we can’t go back. Fuel is in the line so lets start the engine!

parseFile is a companion to it’s exported cousin. This function creates the first object for our AST. Going back to Part 5 yet again (see how important grammar rules are?) we can see that a file object is the root of everything. That’s exactly what we've done. The very first object is the File object.

Note that we don’t really check anything before proceeding to find the first expression. That’s because our grammar rules tell us this has to be true. Anything else is wrong. We expect to find an expression of some kind and we’ll be mighty upset if we don’t get one so we’ll just plow on ahead!

Last, we ensure that the very last thing in our file is the End of File (EOF) token. Our grammar rules show that after we find our root expression there should be nothing else. Report an error if there if something else is found and parade it around for everyone to look at and ridicule!

Expressions


We've discussed what an expression is a few times. Now, our job is to find one with parseGenExpr but we don’t know what kind to expect at first. Well, the first token we find is going to tell us all we need to know. If we find an opening bracket then we have a binary expression. If we find an integer, that’s what we have. Otherwise, we generate an error and move on.

The integer is the easiest element to parse. There’s no point into going into detail on it. The code is self-explanatory.

However, the binary expression is a bit trickier. Calc 1 only has binary expressions but later on we’re going to have more types of expressions. Just like with the Expression object we should have a generic way of handling expressions in general before dealing with the specifics. parseExpr handles this.

First off, we expect to find a left paren. It’s bad if we don’t. Next off, we need to figure out what kind of expression we have. We know that the only expression we have is a binary expression so we’ll test to find out what operator is the next token. If we get something else, we need to report an error.

The BinaryExpr exemplifies the recursive nature of our parser. We've already found the opening paren and operator so now we need to find our operands. This is done via a loop which recursively calls parseGenExpr. We keep building up our tree deeper and deeper until we reach the end.

After we find all the operands we expect a right paren to close the expression. Finally, we return the resulting BinaryExpr object and it gets inserted into our tree.

Expect, Next and AddError


Expect is a great little utility function. We tell it what token we expect to find. If it’s there, great. If not, we report an error. Regardless, we return the position of that element.

Next doesn't really need explanation. It just takes exactly what the scanner finds and stores it in the parser object.

AddError appends a new error message to our ErrorList. If more than ten errors have been found there’s really no point in continuing. It prints the errors and forces the parser to exit with an error code.

Syntax Analysis


If it isn't clear, our parsing stage does syntax analysis as it works. It ensures that the source code we’re parsing adheres to our grammar rules.

Anything less is unacceptable. No, seriously.

Finished


That’s it for this stage. If you got through this then the final bits should be easy. The parser will get more and more complicated as our grammar gets more in-depth. Calc 2 ought to serve as a good example of that.