Saturday, 3 May 2014

Compiler Part 10: Compiling to C

Part 1: Introduction
Part 2: Compilers, Transpilers and Interpreters
Part 3: Overview of Compiling
Part 4: Overview of Language Design
Part 5: Calc 1 Language Specification
Part 6: Tokens
Part 7: Scanning
Part 8: Abstract Syntax Tree
Part 9: Parsing

The final stage at last!

With our language spec being so simple I could have easily skipped C and gone straight to Assembly. I have two reasons for not doing so. First, is portability. I don’t have to write any architecture specific code in C for this tutorial. C has been ported to lots of different systems so I’ll let your C compiler do the work for me.

Second, assembly is going to be a lot more foreign to many programmers than C. Even if you've never written anything in C before it will be easier to understand than Assembly.

Compiling

This should be starting to look familiar. We’re going to be walking the AST and either generating C code or doing a calculation.

The output generated by our compiler will go into a file which shares the same name as the source file with a “.c” extension appended to it. CompileFile creates the output file, parses the source code and starts the output process.

As with the parser, the first element of the AST will be the File object. While the File object itself doesn't contain anything useful outside our root expression, it does provide us with the opportunity to produce some output.

This part actually isn't that exciting. It’s just a bunch of boilerplate. We provide an entry point for the program, an exit status, and a printf statement which requires we also include the C standard input/output header.

Optimization


I said at the beginning of this series that I wouldn’t be talking much about optimization, and I won’t. That said, our language is ideally suited to one type of optimization: pre-calculation.

We could go through the task of emitting code for each element of our tree but to what end? It is surprisingly complex and provides no benefit. Instead, why not do all the calculations at this stage and make the final output simpler and faster?

If the root expression is a number then we pass that number to printf. If it’s a binary expression, we can have some fun.

First stop is a generic function called compNode where we determine what kind of object we have: a BasicLit or a BinaryExpr.

If it is a basic literal, which we know is an integer, we simply convert it to an actual integer and return the result.

The binary expression is just as easy. The first element of the list of expressions is always the start point. Order of the operands is important for some operations like division and subtraction. The first element of the expression list is the starting point and is stored in a temporary variable (as an aside, this would be like using the eax or rax register in assembly).

Then, based on the operator, we compute the result and store it back into the same variable for each additional operand. Once complete, the result is returned. This is done recursively until everything is resolved.

Final Stage


Once the compiler is complete there’s still more work to be done. First off, we need to create a command to read in the Calc 1 source code and call the compiler library. Next, a C compiler (likely gcc or clang) should be run on the output from the compiler command. Finally, depending on how the C compiler was run, you might need to run a linker on the object code output from the C compiler.

This is all handled by a program called calcc, which stands for Calc Compiler. Another example of my brilliant naming skills.

I don't think there's anything spectacular going on in this program. It merely opens the input file, verifying that it has the .calc extension, and calls comp.CompileFile on it. It then uses exec.Command from the Go standard library to execute the C compiler and linker.

It has several flags which can be used to control the C compiler a bit.

Last Words


I sincerely hope I've helped someone learn a little something about compiling. It is a massive subject and it can be pretty overwhelming.

For some, this series is probably a little too simple and for that I’m sorry. I hope you can bare with me when I take things to the next level in Calc 2.

There is a tremendous amount I've skipped over and I hope to continue in the next series to start addressing these shortcomings. Topics I hope to include in Calc 2 are:

  • symbol tables 
  • scope 
  • function declarations 
  • variable declarations 
  • comparisons and branching
  • variable assignment
  • typing 
  • stack memory

I will be basing Calc 2 directly on the code from Calc 1 so anything you've learned here will, I think, be directly applicable in the next series.

If a Calc 3 comes to fruition, and I hope it does, I certainly plan to tackle assembly. If I do, then I might include a few articles solely on assembly itself, possibly as a separate series prior to Calc 3. Other possible ideas are: objects, methods, multiple-assignment, multiple files and libraries.

Thanks for reading!

I bid thee, adieu!