Saturday, 3 May 2014

Compiler Part 10: Compiling to C

Part 1: Introduction
Part 2: Compilers, Transpilers and Interpreters
Part 3: Overview of Compiling
Part 4: Overview of Language Design
Part 5: Calc 1 Language Specification
Part 6: Tokens
Part 7: Scanning
Part 8: Abstract Syntax Tree
Part 9: Parsing

The final stage at last!

With our language spec being so simple I could have easily skipped C and gone straight to Assembly. I have two reasons for not doing so. First, is portability. I don’t have to write any architecture specific code in C for this tutorial. C has been ported to lots of different systems so I’ll let your C compiler do the work for me.

Second, assembly is going to be a lot more foreign to many programmers than C. Even if you've never written anything in C before it will be easier to understand than Assembly.

Compiling

This should be starting to look familiar. We’re going to be walking the AST and either generating C code or doing a calculation.

The output generated by our compiler will go into a file which shares the same name as the source file with a “.c” extension appended to it. CompileFile creates the output file, parses the source code and starts the output process.

As with the parser, the first element of the AST will be the File object. While the File object itself doesn't contain anything useful outside our root expression, it does provide us with the opportunity to produce some output.

This part actually isn't that exciting. It’s just a bunch of boilerplate. We provide an entry point for the program, an exit status, and a printf statement which requires we also include the C standard input/output header.

Optimization


I said at the beginning of this series that I wouldn’t be talking much about optimization, and I won’t. That said, our language is ideally suited to one type of optimization: pre-calculation.

We could go through the task of emitting code for each element of our tree but to what end? It is surprisingly complex and provides no benefit. Instead, why not do all the calculations at this stage and make the final output simpler and faster?

If the root expression is a number then we pass that number to printf. If it’s a binary expression, we can have some fun.

First stop is a generic function called compNode where we determine what kind of object we have: a BasicLit or a BinaryExpr.

If it is a basic literal, which we know is an integer, we simply convert it to an actual integer and return the result.

The binary expression is just as easy. The first element of the list of expressions is always the start point. Order of the operands is important for some operations like division and subtraction. The first element of the expression list is the starting point and is stored in a temporary variable (as an aside, this would be like using the eax or rax register in assembly).

Then, based on the operator, we compute the result and store it back into the same variable for each additional operand. Once complete, the result is returned. This is done recursively until everything is resolved.

Final Stage


Once the compiler is complete there’s still more work to be done. First off, we need to create a command to read in the Calc 1 source code and call the compiler library. Next, a C compiler (likely gcc or clang) should be run on the output from the compiler command. Finally, depending on how the C compiler was run, you might need to run a linker on the object code output from the C compiler.

This is all handled by a program called calcc, which stands for Calc Compiler. Another example of my brilliant naming skills.

I don't think there's anything spectacular going on in this program. It merely opens the input file, verifying that it has the .calc extension, and calls comp.CompileFile on it. It then uses exec.Command from the Go standard library to execute the C compiler and linker.

It has several flags which can be used to control the C compiler a bit.

Last Words


I sincerely hope I've helped someone learn a little something about compiling. It is a massive subject and it can be pretty overwhelming.

For some, this series is probably a little too simple and for that I’m sorry. I hope you can bare with me when I take things to the next level in Calc 2.

There is a tremendous amount I've skipped over and I hope to continue in the next series to start addressing these shortcomings. Topics I hope to include in Calc 2 are:

  • symbol tables 
  • scope 
  • function declarations 
  • variable declarations 
  • comparisons and branching
  • variable assignment
  • typing 
  • stack memory

I will be basing Calc 2 directly on the code from Calc 1 so anything you've learned here will, I think, be directly applicable in the next series.

If a Calc 3 comes to fruition, and I hope it does, I certainly plan to tackle assembly. If I do, then I might include a few articles solely on assembly itself, possibly as a separate series prior to Calc 3. Other possible ideas are: objects, methods, multiple-assignment, multiple files and libraries.

Thanks for reading!

I bid thee, adieu!

Compiler Part 9: Parsing

Part 1: Introduction
Part 2: Compilers, Transpilers and Interpreters
Part 3: Overview of Compiling
Part 4: Overview of Language Design
Part 5: Calc 1 Language Specification
Part 6: Tokens
Part 7: Scanning
Part 8: Abstract Syntax Tree

We've come a long way so far. We've covered the basic principles of scanning and the abstract data tree. Now, we can finally move on to parsing.

If you’re struggling to keep up at the point then I’ll warn you that it just gets harder from here on out. Parsing is probably the hardest part to wrap your head around conceptually. We will be taking the lexemes our scanner detects, give them meaning and storing the resulting objects into the AST.

Make sure you understand the previous material before moving on.

The Parser Object


The parser for our little language is pretty simple. Like our scanner, we have a pointer to the information on the file we’re scanning. We have a way to track errors and a scanner object.

The last three fields directly reflect the same three values returned by the scanner: the position, the token and the literal string.

Entry Point


ParseFile is where the magic begins. We initialize the parser and begin scanning. If one or more errors are encountered we stop and print out the errors. Otherwise, we return the file object, which is the entry point into the program.

Just like with the scanner, init primes the pump to get things started. Any calls to any other parser functions will advance the parser and scanner and we can’t go back. Fuel is in the line so lets start the engine!

parseFile is a companion to it’s exported cousin. This function creates the first object for our AST. Going back to Part 5 yet again (see how important grammar rules are?) we can see that a file object is the root of everything. That’s exactly what we've done. The very first object is the File object.

Note that we don’t really check anything before proceeding to find the first expression. That’s because our grammar rules tell us this has to be true. Anything else is wrong. We expect to find an expression of some kind and we’ll be mighty upset if we don’t get one so we’ll just plow on ahead!

Last, we ensure that the very last thing in our file is the End of File (EOF) token. Our grammar rules show that after we find our root expression there should be nothing else. Report an error if there if something else is found and parade it around for everyone to look at and ridicule!

Expressions


We've discussed what an expression is a few times. Now, our job is to find one with parseGenExpr but we don’t know what kind to expect at first. Well, the first token we find is going to tell us all we need to know. If we find an opening bracket then we have a binary expression. If we find an integer, that’s what we have. Otherwise, we generate an error and move on.

The integer is the easiest element to parse. There’s no point into going into detail on it. The code is self-explanatory.

However, the binary expression is a bit trickier. Calc 1 only has binary expressions but later on we’re going to have more types of expressions. Just like with the Expression object we should have a generic way of handling expressions in general before dealing with the specifics. parseExpr handles this.

First off, we expect to find a left paren. It’s bad if we don’t. Next off, we need to figure out what kind of expression we have. We know that the only expression we have is a binary expression so we’ll test to find out what operator is the next token. If we get something else, we need to report an error.

The BinaryExpr exemplifies the recursive nature of our parser. We've already found the opening paren and operator so now we need to find our operands. This is done via a loop which recursively calls parseGenExpr. We keep building up our tree deeper and deeper until we reach the end.

After we find all the operands we expect a right paren to close the expression. Finally, we return the resulting BinaryExpr object and it gets inserted into our tree.

Expect, Next and AddError


Expect is a great little utility function. We tell it what token we expect to find. If it’s there, great. If not, we report an error. Regardless, we return the position of that element.

Next doesn't really need explanation. It just takes exactly what the scanner finds and stores it in the parser object.

AddError appends a new error message to our ErrorList. If more than ten errors have been found there’s really no point in continuing. It prints the errors and forces the parser to exit with an error code.

Syntax Analysis


If it isn't clear, our parsing stage does syntax analysis as it works. It ensures that the source code we’re parsing adheres to our grammar rules.

Anything less is unacceptable. No, seriously.

Finished


That’s it for this stage. If you got through this then the final bits should be easy. The parser will get more and more complicated as our grammar gets more in-depth. Calc 2 ought to serve as a good example of that.

Compiler Part 8 - Abstract Syntax Tree

Part 1: Introduction
Part 2: Compilers, Transpilers and Interpreters
Part 3: Overview of Compiling
Part 4: Overview of Language Design
Part 5: Calc 1 Language Specification
Part 6: Tokens
Part 7: Scanning

Before we can build the parser we need to take a detour and talk about how we’re going to handle the data we’re going to process.

We need to use some kind of abstract data type to hold all the data we’re parsing. As it turns out, a tree data structure is perfect for our needs. This tree describes the syntax of our programming language and is appropriately called the Abstract Syntax Tree (AST).

AST


Tree data structures always start with a root and ours is no different. Typically, in a full-fledged compiler, you’d probably have an object which represents a package or program. In our case, we only have one file so we’ll have an object called File.

The rest of the objects we need to create can be found in our grammar blueprint. I refer you, again, back to Part 5 where we created our grammar rules. This handy layout tells us exactly what we need. Aren’t you glad we created it? It’s also going to come in very handy when we start parsing and syntax analysis.

The other objects we’ll need to create: an expression and an integer.

Go Interfaces


Before we create our objects we need to consider what they’re going to look like. What is it we know?

We know that our scanner is going to provide us with three things:

  • The literal string of the element we've scanned. A number might be “42”.
  • The token representing the element. An opening parenthesis would be token.LPAREN.
  • The position at which the element starts. A number between the file’s base and base + size.

One thing that will be common to all objects in our tree is their position. Where do they start and where do they end? Error reporting, for example, won’t care what kind of object it has. It only cares what it’s position is. So, we create a Node interface which has two methods: Pos, for the start position, and End, for the end position.

Next, there is a recurring element in our grammar rules: expressions. An expression can either be an integer or a sequence of elements encapsulated in parentheses. The interface will be called an Expr. Any Expr objects must implement the method exprNode as well as fulfill the Node interface.

We can now pass our objects around between different parts of our compiler based on their interface and use type assertions to discover their exact type, if required. When we get to parsing, this ought become more clear.

Objects


There are three we have to implement as well as one extra helper object:

  1. Integer
  2. Expression
  3. File

We’ll start with the Integer. We have to be a bit forward thinking with our objects and the integer type is no different. We could spend a lot of time creating an object for each type in our language. We could, however, just create a single object to hold any basic type.

BasicLit is just such an object. Whether we have an integer, float, string, or character it doesn't matter. This object satisfies our needs for all of them. Of course, we only have integers at the moment but eventually we want to add more. This object holds the token telling us what type of object it is, the position where the literal starts and the string representing the literal.

Skipping over the Expression for a moment, we’ll move on to File. This object is even simpler. This is where it all starts. It’s our starting point and has one field, Root. Looking back at our grammar rules we can see that a file holds an expression and that expression is the root of everything else.

Our expressions are of a specific type right now. They’re binary expressions. Mathematics. We have several elements we have to keep track of: An operator, opening and closing parentheses and two or more operands which are, in of themselves, expressions.

A binary expression won’t always be our only type of expression. Eventually, we’ll want more. One thing that will be common to every expression in our language is an opening and closing paren. That’s where the generic Expression object comes in handy. It’s only job is to handle the parens. This object will be nested into all our sub-expressions.

The beauty of Go is that any objects with an embedded  type sort of “inherit” the nested object’s methods. It’s not true inheritance but rather the compiler being clever. Regardless, the result is the same for our purposes.

As already stated, the BinaryExpr must have an operator and a list of operands. We have fields that hold the operand and where it is located is the sources. Our operands are held in the field List which is a slice of objects which fulfill the Expr interface.

AST Utilities


If we need to inspect our AST, a method of walking the tree would be handy. The simple Walk function does just that.

Printing the AST is done via the Print method which merely Walks the tree and prints out some information about each node in order.

Moving On


We’ll finally get to parsing soon but I want to be sure the code for the AST is clear. There’s that nasty recursive definition of an Expression and likely some methods that don’t provide any context to their use. Make sure you review this article and any previous posts before moving on.

Thankfully, parsing ought to make things more clear how all the pieces fit together.

Compiler Part 7 - Scanning

Part 1: Introduction
Part 2: Compilers, Transpilers and Interpreters
Part 3: Overview of Compiling
Part 4: Overview of Language Design
Part 5: Calc 1 Language Specification
Part 6: Tokens

Now, we can finally work on the scanner.

Lexical Analysis


So, where do we start?

That was the hard part. Scanning, to me, seemed like it should be easy but I quickly got lost in the details. There are many ways to implement a scanner and I’ll show you how to do one type. Rob Pike did a great presentation on another cool method in a talk called: Lexical Scanning in Go.

The basic concept of this scanner is that we start at the top, moving left to right, until we reach the end of the source code. Each time we find an element of interest we’ll report the literal string found, a token telling the parser what it is and the position at which it was found.

Finite State Machine

Now, I’m not going to go into any real detail about things like a Finite State Machine (Finite Automata) or anything. You should investigate this yourself. Coursera has a course on compiler design you may find of help, too, that covers this topic. The concepts are important but its not strictly necessary to know every little detail (though I do encourage you to learn them).

The basic idea here is that we have a finite number of states which our scanner can return. These states are represented by tokens and since we can only return the limited tokens we’ve defined, we can say that our scanner has finite state. Hence, finite state machine. Understanding automata comes in handy when understanding regular expressions and/or when accepting or rejecting an individual character being scanned.

This will all become clear shortly.

Oops


I’d like to make it very clear one of the mistakes I made on my first attempt at writing a scanner. Writing any part of the compiler without some kind of solid definition for your language is a terrible idea. If the core design of your language is still fluid then you’re going to be rewriting your compiler a lot. And I mean, a lot. I had to rewrite my first attempt at an interpreter several times, pretty much from scratch, each time I revised my language. A complete waste of time.

It was this process that finally made me realize how bad decisions are made. What at first seems like a good idea might turn out to be a poor idea later but not making a decision at all will end up being disastrous. Many times I've criticized a language’s design asking myself, “Why in heck would you have done THAT? It’s so stupid!” Hindsight is 20/20 my friends.

The Scanner

The scanner is fairly simple. We start with a simple object which tracks things like the current character scanned, the offset from the beginning of the file, something we’ll call the reading offset, the source code being scanned and a pointer to the details about the file itself.

The first step to scanning is to initialize the scanner via Init. There’s nothing very special here outside the call to the next method, which I call “priming the pump.”

The next method is a funny little function. It first resets the current character to zero to indicate the end of the file. If the reading offset is less than the length of the file then the offset is changed to the reading offset. If a newline is encountered we make note of it’s location in the file object. Note that we discard the newline but still note its location. Finally, we update the current character and increment the reading offset.

Reading Offset and Unicode

What’s the deal with it? It mainly has to do with Unicode. A single character may take one or more bytes so you can’t just increment the offset by one each time. The UTF8 package’s DecodeRune function returns the number of bytes in the next character. The reading offset is used in that situation to mark the start of the next rune to read.

While this scanner won’t be Unicode friendly we can still start incorporating some of the functions we’ll end up needing so that we’ll have less work to do later when we add it. We’ll also be using the IsDigit and IsSpace functions from the unicode package.

Scan


This puppy is the meat and potatoes of the scanner. The Scan method starts by skipping any whitespace. The method skipWhitespace simply advances the scanner one character at a time until it reaches a non-whitespace character. I've utilized the unicode.IsSpace function to achieve this goal.

Next, it looks for multi-character elements. In this case, we are just looking for numbers. After that, we look for single character elements and finally wrap things up by reporting either an illegal character or the end of the file.

After each pass we always need to increment the scanner to the next character via a call to next and return the results of the scan.

We should also have our handy language spec at our sides. It tells us exactly what to do. Head back to Part 5 if you need to find it again.

Integers


If we encounter a digit, we scan for a longer sequence via scanNumber.

I've chosen to use the unicode.IsDigit function but we could have just as easily written our own implementation. Something as simple as: return s.ch >= ‘0’ && s.ch <= ‘9’ would have sufficed. scanNumber works by continually advancing the scanner until a non-digit is found. The if statement after the loop handles the instance that a number occurs at the end of a file.

In a later version of Calc, or on your own, I will expand this function to include various other numerical forms like floating point and hexadecimal.

More Scan


If a digit was not found, we move on and check for single characters. These should all be pretty self-explanatory with maybe the exception of the semi-colon and default.

Comments start at a semi-colon and go to the end of the line or end of the file. In the current incarnation of the scanner comments are discarded but it would be just as simple to keep them. For instance, the Go scanner reports any comments it finds and passes them along to the parser so that the parser can create those wonderful Go docs we all know and love!

Wrap Up


That’s all there is to the scanner. It’s fairly straight forward.

On to parsing!

[Edit: Add link to Rob Pike's talk on Lexical Scanning in Go]

Compiler Part 6: Tokens

Part 1: Introduction
Part 2: Compilers, Transpilers and Interpreters
Part 3: Overview of Compiling
Part 4: Overview of Language Design
Part 5: Calc 1 Language Specification

In this post we can finally start to lay down some code!

Tokens


Previously, we discussed our grammar and the set of tokens we’ll need to scan for. We defined an expression, a number and operators. We also identified we expect to encounter some parentheses. We should also let our parser know when the scanner has reached the end of the file.

Before we can start scanning we need to formalize our code for the tokens our scanner will be emitting. Our tokens are going to be used in almost all stages of our compiler. If we want to create tools like Go’s fmt or vet programs we might need to reuse these tokens.

Here’s the first section of code: https://github.com/rthornton128/calc/blob/calc1/token/token.go

The constants might look a bit funny at first. There are some lower-case, non-exported tokens intermixed with upper-case, exported ones. The non-exported tokens will help us with writing some utility functions and allow the language to expand without modifying any other code.

https://github.com/rthornton128/calc/blob/calc1/token/token.go#L36

Next comes a mapping of Tokens to strings. I could also have used an array of strings but I didn't. It makes writing the Lookup function easy.

https://github.com/rthornton128/calc/blob/calc1/token/token.go#L50

The rest are some utility functions. You can see with IsLiteral and IsOperator where our non-exported constants come in handy. No matter how many new operators or literals we add we never have to adjust these functions. Handy!

https://github.com/rthornton128/calc/blob/calc1/token/token.go#L58

Lookup, String and Valid help us when generating error messages.

Positions


This file might take a moment to wrap your head around. I’ll try and take it slow.

When scanning, we start at the first character in the stream, starting at the top and working down, moving left to right. The first character is at offset zero.

By comparison, when reporting an error a user wants to know the line and column at which the error occurs. The first character would be line one, column one. We therefore have to translate the position of the character offset into something meaningful to the end user.

A position, Pos, is the character offset plus the file’s base. With a base of one, a character at offset zero has a Pos of one.

It is illegal to have a position of zero because it would be outside of the file. Consequently, it would also be illegal to have a position greater than the base of the file plus it’s length.

Why something so complicated you might be wondering? Well, once we need to parse multiple files it can be quite confusing to identify in which file an error occurred without a lot of overhead. The Pos makes it easy. More on this in the next series.

The Position type is strictly for error reporting. It allows us to output clear information about which line and column the error happened at and in which file it occurred. At this stage we’ll only ever be dealing with a single file but we’ll be grateful for this code later.

Files


File is, strictly speaking, completely unnecessary for writing a compiler but I feel that good, clear errors messages are vitally important. Go does a pretty good job but many compilers don’t. The GNU C compiler, for instance, used to be horrid. It has improved a lot over the years.

It provides a framework for something yet to come. This code is really just laying the groundwork for when we have to process more than a single file.

It’s sole purpose is essentially for error reporting and goes hand-in hand with the position code. Again, since we only have one file the base, or start, position is one. It can’t ever be less but in the future it could be higher. Don’t worry about that now, though.

Each time the scanner detects a newline character, we want to add that position to the list of newlines in the file. This allows the Position function to accurately calculate where an error occurred and to report it’s position.

Summary


That about covers the token section. As promised, I've not talked much about the support code.

I suggest referring back to this code often. Once you begin to understand how it all works together it will make a lot more sense. This library gets used pretty liberally throughout the compiler so we'll be coming back to it often.

Compiler Part 5: Language Specification

Part 1: Introduction
Part 2: Compilers, Transpilers and Interpreters
Part 3: Overview of Compiling
Part 4: Overview of Language Design

It's finally time to cover the design specification for Calc!

Defining the Language


I want to keep the language as simple as possible. I called the language Calc. You know, for calculator. Clever right? Right? Ok...uhm, moving on!

I also want a single, basic type. I decided to go with something equally as clever as the language’s name and called it an Integer. Brilliant, I know. Your admiration is duly noted. I decided to avoid dealing with floating point numbers for simplicity's sake as well as octal, binary, hexadecimal, and scientific notations. I leave it as an exercise for yourself to add these other notations in.

We’ll also need to specify the end of a file, a comment and mathematical operators.

It will also be essential to know operator precedence and encapsulation when we start to parse and evaluate our language. Call it a blessing or a curse but I decided at this time that I didn't want to mess with operator precedence. So, I've decided to use prefix notation and Lisp style expressions. This means that the operator will precede the arguments and be encapsulated in a left and right parenthesis. An example of adding two and three would look like:

Example 1 (simple expression)

(+ 2 3)
Result: 5

This has two advantages: 1) operator precedence becomes explicit; and, 2) we can perform an operation on an arbitrary number of values.

Example 2 (operator precedence):

(* (+ 3 1) (/ 4 2))
Result: 8


There’s no question here what operations are performed in which order. The addition and division operators are calculated independently, prior to the multiplication. No BEDMAS for you!

Example 3 (multiple values):

(* 1 2 3 4 5)
Result: 120


In this case, we process each element of the expression, moving from left to right. So we could rewrite this function in the equivalent following forms:

(*5(*4(*3(*2 1)))) == ((((1*2)*3)*4)*5) == 1*2*3*4*5

So, we need to define tokens for the left and right (open and closing) parentheses, and each mathematical operation we want to perform.

Last, we ought to throw in the ability to make comments. We’ll stick to single line comments but multi-line isn't that hard to add. Let’s use a single semi-colon to denote a comment.

The Tokens We Need

  • Left Parenthesis 
  • Right Parenthesis 
  • Addition Operator 
  • Subtraction Operator 
  • Multiplication Operator 
  • Division Operator (Quotient) 
  • Modulus Operator (Remainder) 
  • Integer 
  • Semi-Colon (Comment)
That’s everything we need to start lexing. The diligent will note that I have not defined tokens for any keywords, built-in functions or variables. That’s because our simple language doesn’t have any.

The point of this initial series isn’t to teach you to design a turing complete, fully functional programming language. The sole purpose is to give you the basic techniques to write an initial, working compiler.

I am currently writing a follow up series discussing how to add these features.

Whitespace


The next thing we have to decide on is how to handle whitespace. In languages like Lisp or Go, whitespace is mainly ignored. I say mainly because it’s not completely ignored. Strings and comments don’t ignore whitespace. It is also used to separate elements like numbers. However, it is ignored in the sense that is is not important to the syntax or semantics of the language.

Whitespace, incidentally, is typically a space, tab or newline. On Windows, you’ll want to toss carriage returns in with that lot, too. Basically, any non-visible characters you can type on a standard keyboard.

Now, your language could take whitespace into consideration. In Python, for example, it is very important, where indentation indicates statement blocks. Calc will not.

Syntax


We've outlined the elements our code will contain. Now we need to provide some meaning to it.

We are largely concerned with expressions in our language. We make no statements (terrible joke, I hope at least one person gets it). We want to express mathematical equations.

Each expression must be encapsulated in brackets, begin with an operator and must contain two or more operands. An example:

“(“ operator operand1 operand2 [… [operandN]] “)”

I've put the parentheses in quotations to signify they are a literal, required element which do not have extended grammar rules. The operator and first two operands are also required elements. The square brackets signify the additional operands are optional.

What can our operators be? They may only be one of the following symbols: + - * / %

How about our operands? Well, now things get a little trickier but I think we can handle it. Our operands can either be number or another expression. This allows us to nest expressions for making complex calculations.

I’d like to express our syntax in another form. It’s a grammar for defining something called a context free grammar. This particular one is a variant on Backus-Naur Form (BNF) with some regular expressions thrown in. Using this notation, we can express our language like this:

file ::= expr
expr ::= “(“ operator whitespace expr (whitespace expr)+ ”)” | integer
operator ::= [+-*/%]
whitespace ::= [ \t\n\r]+
integer ::= digit+
digit ::= [0-9]


I am no expert on context free grammars so hopefully I haven’t cocked it up. As you can see, it represents the requirements listed above quite succinctly and will make designing our parser much easier than without it.

Final Notes


Our language is quite simple but the grammar helps to make it clear what we expect and will make writing our scanner and parser much easier.

Looking at the integer definition, it’s quite clear what we need to scan for. We will need to find one or more digits between zero and nine. The same is true of white space where we need one or more spaces, tabs or newlines.

When parsing we know exactly what to expect. Our program consists of a single expression. Each expression is enclosed in brackets and contains exactly one operator which must be the second element of the expression. The operands may either be an expression or a number. We must have at least two operands.

Compiler Part 4: Language Design

Part 1: Introduction
Part 2: Compilers, Transpilers and Interpreters
Part 3: Overview of Compiling

In part one we had a quick introduction to what this series is going to be about. In parts two and three I gave a very brief, non-exhaustive overview of the steps involved in compiling a computer programming language.

In this post, I’m going to be a bit more specific rather than the broad overviews in the last few articles. I’m going to discuss the language specification for our language.

Language Design Overview


Did I not just finish saying I was going to stop with the overly broad overviews? Turns out, I’m a dirty, rotten liar.

I’ll try and be as brief as possible so we can get into laying down some code but I want it to be clear that this is a very incomplete view of language design. Computer language design is a topic you can devote your life to. In no way am I going to give you any theory behind the design choices in Calc. I’m not going to discuss much at all.

In fact, I’m basically going to just tell you what the language specification is.

I want to be completely frank that Calc 1 isn't even really a programming language. It’s sole purpose is to teach the basics of building a compiler and act as a spring-board for future specifications of Calc to be covered in a future blog series.

Language design is hard. Very hard. It’s easy to say: “I want this feature, that feature and those features.” Why? For what purpose? How will it complicate the design? Could it confuse users? Is it necessary? Who is your target audience and would they want or need the feature?

Loops


These are not questions that are easy to answer even with tremendous amounts of experience. Some features, like looping constructs, might appear like a no-brainer but are they? Let’s take the two most common loop constructs: for and while.

Almost every computer language ever designed has some method of looping, whether it’s a jump like a goto statement or the very common for and while loops. Recursion is another form of looping. The most popular tend to be for and while.

First question, first: Why?

Well, we want to repeat an action. OK, pretty easy.

For and while both require an exit condition to complete. We usually associate these conditions with the logical true and false. Both have slightly different signatures:

while(condition) do { action }

for (start; condition; increment) do { action }


Why? Well, originally, I think it was logical to think of these loops as different. The Go developers, however, saw some commonality between the two constructs and asked themselves this very question: "Why?"

Why have two keywords, two reserved names, to essentially do the same thing? It might be harder to parse but wouldn’t it be nice for the end user if they just had to use one looping construct and only a single, reserved keyword instead of two?

The Go developers chose the shorter for keyword. I think it’s a more intuitive word which likely had bearing on their decision, too. ‘for (condition) do { action }’ makes sense. Suddenly, our language is simpler and cleaner by challenging a simple premise we’ve long taken for granted. Why use two keywords when one will suffice?

Trade Offs


There’s always a trade off. Complexity for simplicity. Speed for convenience.

The cost of the for loop decision was some mild complexity in parsing. The developers made a conscious choice to make parsing the for statement harder in order to provide a vast improvement in productivity for the programmer.

Generics is a hot topic. Go doesn’t provide any. Why?

Generics are all about trade offs and I can’t even begin to explain them to you. Your best bet is to do some searches in the golang mailing list and look at the FAQ.

The end result is that you need to decide what you want. Do you want a lean and mean language or a complex one that does everything under the sun? Why did you make the choices you did?

Just because a feature is in every other language doesn’t mean it needs to be. The while loop is a perfect example of that.

A switch, in Go, is so much more than it is in languages like C. It’s a simple concept made vastly more powerful without adding much cognitive complexity for the programmer.

Type System


Dynamic or Static? Strong or Weak? Explicit or implicit?

I, personally, prefer a strong, statically typed language. I like to be sure that what I expect is what I get without having to do anything more.

Can, or should, a type be castable? Can a floating point number be an integer? Could a character be a float? What is a string?

Bounds checking on arrays? Slices?

Pointers and addressable variables? Variables passed by reference or by value?

Like I said, something seemingly simple becomes very, very complex quickly.

Summary


I have new-found respect for many languages. I’ll be honest. I don’t like Java. I recognise its strengths but writing code in the language feels like programming with a donkey. That said, I have to respect the designers. I have to respect what they’ve done and achieved. I have to respect how what perhaps was the best, or only, decision back then is a bad idea now. Java does a job and millions of programmers use it. You have to respect that.

You will make bad design choices, no matter how clever you think you are. I’ve followed the Go mailing lists since early 2011 when I first became aware of the language and I’ve seen how the language evolved until it was finalized in its 1.0 release. There have been, and continues to be, many arguments over the design of the language. Language design is HARD and your decisions won’t please everyone.

As I said earlier, language design is a huge, huge topic. Many languages, I think, fall victim to wanting to please everyone and anyone. They have toolboxes full to overflowing. Other languages draw a line and refuse to cross it, even in the face over overwhelming evidence of its necessity. Both types of governance piss people off.

You need to choose which camp you want to be in and why.

Next up: The Calc1 Specification

Compiler Part 3 - Compiler Design Overview

Part 1: Introduction
Part 2: Compilers, Transpilers and Interpreters

What Are We Writing?


A calculator. Sort of. A compiler for a super simple mathematical language.

We are, at least for now, going to avoid the complications of dealing with strings and characters and focus on numbers. Not just numbers, whole numbers. Integers to be specific. This is the only “type” we’ll have in our system, for now.

Adding new types isn't exactly hard, and indeed is desirable, but makes our design much more complicated than need-be at this stage in the game.

Stages of Compilation


Whether you are building an interpreter or a compiler most of the steps remain the same. The most common, basic steps are:
  1. Lexical Analysis
  2. Parsing
  3. Semantic Analysis
  4. Optimization
  5. Code Generation
We’ll attack each of these steps one at a time.

Overview: Lexical Analysis


We've all written code before. It’s a “human readable” representation of what we want our programs to do. We need to take something that is human readable and make it something a computer can understand. The first step, is lexical analysis.

The short of it is, we have to scan through the text and report what was found. That is, we have to pick out all the different unique parts of our language and assign an identifier to them called a token. Each token is associated with a lexeme, the literal string of characters, for handling in the next stage. Literals (numbers, strings, etc), keywords and operators are examples of the kind of things we identify in this step.

For error reporting, we should also provide the position at which the token was found.

Overview: Parsing


In this stage, also called syntactic analysis, we start to give meaning to the tokens we’ve found. Each token is represented by an object and placed into a tree data structure.

The language syntax is verified at this stage. We want to ensure that the order of what we’ve received is correct. In LISP, an expression takes the form of an opening bracket, an operator or function, followed by any number of arguments and a closing bracket. Parsing ensures that these requirements are met.

Overview: Semantic Analysis


Next, we check to ensure the semantics of our language are met. If a variable has been declared as an integer but assigned a string, we have a problem. The number of arguments of a function call not matching the number of parameters of the function signature is another semantic error.

Semantic Analysis is a;sp concerned with things like variable scope and generates errors if a variable is used that hasn't been declared.

Overview: Optimization


This stage does what it says. It optimizes the output, usually for speed or size (memory consumption). I won’t be covering much of this in this series but an easy example would be assigning a variable the result of an expression rather than the expression itself.

For example, what if we assign the variable A the statement ‘2 + 3’. Adding two static numbers could require several steps in code generation. This can be optimized by pre-calculating the result and assigning it in the generated code, simplifying the statement to ‘A = 5’.

Overview: Code Generation


This is the final stage. It outputs the lower level code we want to emit. A Java compiler would output Java bytecode. A C compiler might output assembly. An assembler outputs machine code.

Moving On...


Bear in mind that there can be more or less steps in your compiler. A C compiler, for example, also has a pre-processing step. Some steps, like parsing and semantic analysis, can be wrapped together into a single step. There is a lot to compiler design. For a broader overview, check out this Wikipedia page: Compilers.

It would be great if we could dive into writing a compiler now. Unfortunately, that would be folly. Without knowing what our language requires we’d be at a loss as to what to actually write. Without a set of rules to follow we’d have no direction.

Next up: the language specification!

Compiler Part 2: Compiling, Transpiling and Interpreting

Part 1 served as an introduction to this series.

In this second post, I want to do an overview of some definitions before we dive in to the actual steps of compiling.


Compiling


Compiling is the act of taking code written in one language and translating it into a lower level language. A C compiler doesn't usually output machine code directly. Instead, it translates the C source code into assembly. The assembler takes that output and compiles it into machine code. C# and Java are translated into bytecode. Bytecode isn't transformed into machine code until it is executed by the virtual machine.

It’s important to understand this distinction.

Compiling often consists of using an intermediate representation (IR) or intermediate language. Assembly is a common intermediate language. LLVM has an IR imaginatively called LLVM IR. C can also act as an intermediate language.


Transpiling


Transpiling, by contrast, is transforming code from one language to another language of equal or higher level. Translating Go to Javascript is an example of this.

Do not confuse this with what languages like Scala or Clojure do. Both of these languages are compiled directly to Java bytecode and use the JVM to execute their code. Neither are considered to be transpiled because Java bytecode is a lower level of abstraction. Were they translated into Java first prior to being compile to bytecode then they might be considered transpiled languages.

Translating Go or Lisp to C isn't really transpiling either, though it rides a fine line. C is a lower level language although it is certainly a higher level language than assembly.


Interpreting


Interpreters are usually associated with scripting. Rather than translating a language into another language or IR the language is, typically, interpreted directly.

In some cases this means, quite literally, scanning the code and interpreting it on the fly and halting the moment something isn't right. In other cases, it means scanning through the entire code, validating it and then executing it from the internal tree data structure holding all the information.

In a sense, an interpreter of this kind must “compile” the script or source code each time it is run, making it much slower. This is because it needs to do all the steps a compiler does each time the script is executed rather than doing that process only once.

Many modern interpreters don’t typically do this, however. You may want to investigate Just In Time (JIT) compiling if you’re interested in the details. The short and simple version is that the script is interpreted only once, just like a compiler, and is stored in some kind of intermediate form so it can be loaded faster. This process also allows the designers to incorporate optimizations to further speed up code.

Binaries, Compiling


The objective of compilers is often to create an executable binary, or at least some kind of object  the machine can read and execute.

It’s not so simple to just create machine code. Compiling is actually just the first step.

Here are the basic steps that GCC and Go might use to compile their source code:

C using GCC:


  1. Translate C into assembly via GNU C Compiler (gcc);
  2. Translate assembly into machine code via GNU Assembler (gas);
  3. Link machine code with standard library and produce an architecture specific binary via the GNU Linker (ld).


Go using Go Compiler on x86-32:


  1. Translate Go into assembly via Go compiler (8g);
  2. Translate assembly into machine code via Plan 9 Assembler (8a);
  3. Link machine code with standard library and produce an architecture specific binary via the Plan 9 Linker (8l).

As you can see, the compiler is just one step of the process. In these articles we’ll only be dealing with step 1. Assembling and linking are well beyond the purpose of these articles.

Don’t fret though! Our compiler will produce an executable binary!


Next


Now that we've taken care of some definitions and understand the process to making executable code we can look at a compiler in a little more detail in Part 3.

Compiler Part 1: Introduction to Writing a Compiler in Pure Go

Introduction


I've long been interested in learning how a compiler works. Cryptic compiler messages and odd behaviours have always baffled me. I’ve never quite understood how optimizations are done or how it is a compiler knows what I've done wrong.

When I first decided it was time to learn how to write a compiler I found out that there are a lot of terms and acronyms not often used outside of the discipline. What was an SLR or LALR parser? What in the heck is a lexeme or finite automata? What is recursive-descent parsing? What is an AST?

It can be pretty overwhelming at first.

The largest hurdle I found was that most online courses, tutorials and other learning materials on the Internet don’t teach you how to write a compiler or interpreter. Correction; they don’t teach you how to build one from scratch but provide you an overarching, top-down approach, relying on existing tools to help you do the job. This is because building a compiler is a pretty big job and it would be nigh impossible to cover everything in a single semester. I can’t fault anyone for that.

That said, I felt these tools skipped over what I think are valuable lessons and information. I wanted more.

So, I want to take you on a tour of my personal journey through learning how to write a compiler.

Bare in mind, I’m not a professional programmer nor am I finished figuring everything out. That said, I've discovered so many wonderful things that I can’t help but want to share them in the hopes they’ll help you discover your own path. After all, it’s why you’re here, isn't it?

Credit Where Credit is Due


Calc is an original work and while all the code for this project has been hand written there can be no denying parallels and similarities to Go and Go parser package in the standard library. In learning to write a compiler I have often referenced those works when I got stuck. Any and all credit must go to the Go developers for any code that resembles any of their collective works. You can find the source code for the Go compiler at golang.org.

Prerequisites


As much as I’d like it to be, this series is not really intended for novice programmers. It assumes you have some programming experience and are already familiar with the Go programming language. Therefore, it is targeted at more of an intermediate programmer.

While I think a more novice programmer might be able to follow the articles I will make no attempt to explain Go or any code not directly related to compiling. There are several utility functions which I will point out but since they aren't explicitly about compiling I won’t explain them.


Tools


Compiler design is a component, or perhaps companion, to language design. It’s a massive topic and each step of the compiler is a lesson in of itself.

There are a couple tools out there that people use to simplify or expedite the process of building a compiler. Each tool tends to focus on one particular step of the process. I am telling you about these tools so that you know that they exist but they are not relevant to this blog series. There are, however, many articles and classes that will teach you compiler design by using these tools.

Flex and Bison are probably the most common open source options available. Flex is based on a program called Lex, short for Lexical analysis, which is the first step of compiling. Bison is based on a program called YACC, an acronym for Yet Another Compiler Compiler, which does parsing and builds something called an abstract syntax tree.

I don't expect you to understand any of that, so don't worry. I will caution you that trying to use either of those tools without comprehending how a compiler works will be difficult in the extreme. You should prepare for some hurt.

I have actually found it more useful to skip these tools and learn the basics, first. I think building the entire compiler stack by hand provides a better solution in the long term anyway but that’s just my opinion. Take it for what you will.

The problem, for me, is that there ends up being a very serious disconnect and loss of control between each stage of compiling. It also begs the question of: “how does it actually work?”

The Solution


That’s why in this series I’m going to cover each individual component and we’re going to write code from scratch. It might take longer but I think you’ll have a stronger understanding in the end. If, after we’re done, you decide you want to use the aforementioned tools you’ll be able to do so much easier and make a more informed decision.

Stay tuned for Part 2 to start getting our feet wet!

[Edit: Updated the tools section to be more clear in its intent]