No Effing Clue: Go

Showing posts with label Go. Show all posts

Saturday, 14 November 2015

Compiler 3 Part 8 - Finale

This will end up being the longest, and final, post in the series. So let's tuck in!

Testing and Everything Else

The scanner and most of the infrastructure stayed exactly as it was.

The parser and AST did see some changes. The parser, in particular, saw several bug fixes and simplification. Some of the error checking that had been done during parsing has been passed on to the semantic analyzer. Similarly, anything related to type checking could be removed.

Also, the AST-related scoping code could be simplified and confined to the parser since it is discarded during semantic analysis.

Detecting “true” and “false” was added for boolean the type.

With those changes out of the way, I can say that if there is one area that Calc has truly lacked, it was in testing.

A compiler project should, in my opinion at least, be concerned with correctness. Correctness and consistency. The IR was one step in shoring up Calc in this area and testing is the other. Calc has always incorporated at least a little testing but it has never been as comprehensive as it should be.

Go provides great testing facilities right out of the box. The cover tool has also proven helpful in sussing out corner cases that need testing.

I have, I think, made large strides have been made in making Calc a much more stable compiler.

That isn’t to say that it’s perfect. I’m sure within 5 minutes of playing with it someone will discover some bugs. Hopefully, those than find them will take the time to submit an issue so I can create tests and fix them.

This is the value of having an open source project.

The Future...

Calc has continued to evolve along with my own skills and I hope that I’ve helped someone, in some way, to further their own growth. Whether it be teaching someone how to, or how NOT to code, the end result remains the same. Someone, somewhere, has benefitted from my efforts.

Now that I’ve had a chance to regroup from the herculean effort of writing and implementing the Calc2 spec and teaching series, I feel like I can continue to produce more content.

I do hope to continue talking about compiler construction. In the future, I hope that by using a smaller format, like this one, that I can produce more posts more quickly. I can certainly say that this third series has been much more enjoyable to write!

I have also been working on another side project to create a virtual machine with a complete software stack. It has an assembler, linker and virtual machine running it’s own bytecode. If there is interest, I’ll write some blog posts about it. You can find the project here.

I have concluded both previous series talking about future plans and I shall do no less today. In series two I had a list of things I wanted to implement and change. Let’s go through the list:

growable stack - well, this is null and void. Calc no longer manages it’s own stack.
simple garbage collector - I no longer plan on implementing a garbage collector myself. See below.
loops - not implemented
generated code optimizations - done!
library/packages - not implemented
assembly - null and void. I don’t plan on generating assembly.
objects/structures - not implemented

Only one of the seven features I wanted to implement was actually completed. Of the remaining six features, three of them no longer make sense. The groundwork for other three has been laid down. In particular, loops and structures only await language design since implementation should be almost trivial.

Incorporating the remaining three features, I now present you with an updated list:

#line directives for source-debug mapping - I think this is a very reasonable and easily obtainable goal. It will make debugging Calc with the GNU debugger easier.

logical && and || operators - are easily added with minimal work

importing libraries - thanks to the x/debug/elf package, importing libraries into code may not be unreasonable to achieve in the near future. The scope of adding imports is likely a series in of itself.

garbage collection - Calc 2.x does not need garbage collection. It does not have pointers and it does no memory allocation. However, when it does, I will probably use the Beohm garbage collector rather than attempting to spin my own. Additional info can be found on Wikipedia. This feature will be far in the future so don't expect it any time soon.

structs, loops and arrays - with the new code generator, implementing structs and arrays on the back end ought to be trivial. Work on this has actually already begun and you can view some of the changes here. Sorry, the specification is not published publicly right now.

stack tracing - there isn’t much nothing holding me back from implementing stack traces on crashes now that the new code generator is in place. Time and effort are all that remains.

And that, as they say, is that! Thank you for taking the time to read through this series! At some point I’d like to put everything into a PDF but that would be a large task since large parts of both series would need to be re-written entirely. In fact, I might even want to start again from scratch.

Until next time!

Friday, 13 November 2015

Compiler 3 Part 7 - Code Generation

Here we are at the final step.

As mentioned in the introduction, I cut out a massive amount of code by offloading much of the work on the IR.

So here is what has changed:

I introduced a small function to map Calc types to C types. This could probably exist in ir.Types, too, but I chose to keep it coupled with the code generator since that’s the only code that uses it.

Any object with an ID has its original named stripped and is replaced by a generic variable name. These names start with an underscore and the lower-case letter “v” followed by the ID of the object.

Each binary operation is assigned to a new variable (a reason for why C99 is required). Don’t worry about this being wasteful. Even if you chose to output a chain of infix arithmetic (1 + 2 + … + N) the underlying assembly instructions usually take no more than two operands anyway. This is why using the above method works so well since it more closely matches the machine code.

Note: I've spent a lot of time comparing the assembly generated from C to see how things work. This is a fun (am I sick?) and interesting exercise to try yourself.
Even if you didn't follow along with the last series, I encourage you to view the previous binary code generation function. 58 lines of confusing mess now pared down to a simple 5 line function. Perhaps more importantly, the generated code is easier to read and follow even though it’s not really intended for visual parsing.

Another important change is using C-style function calling and function declarations. This was made possible by the new IR and type system. With every object being assigned a valid type, we can easily create proper C function prototypes and definitions.

By utilizing the SSA-like code generation and the new IR it also becomes trivial to use C-style calling convention. Types have already been checked, the number of arguments verified, and sub-expressions have been assigned ID’s. Therefore, only raw values and object IDs are passed into the function.

All in all, fairly simple.

Tuesday, 10 November 2015

Compiler 3 Part 6 - Tagging

This step is crucial to Calc’s code generator but may not exist at all in other compilers. Regardless, it makes code generation for Calc dead simple. Before I can get into the process, you do need a little background information.

In the introduction I mentioned something about 3AC and SSA. Make sure you check out the articles but the cut and dry is thus:

In SSA, each calculation is given a unique ID. These ID’s replace variable names and other objects.

So, what does this mean to us?

Consider the following two example infix arithmetic operations:

a + b + c + d
a * (b + c) / d

These two operations could be translated into something like the following:

Example 1: a + b + c + d

r1 := a + b

r2: := r1 + c

r3 := r2 + d

Example 1: a * (b + c) / d

r1 := b + c 

r2 := a * r1

r3 := r2 / d

As you can see, the result of each binary operation is assigned to a new, unique variable. While verbose, it is much more akin to assembly and easy for C compilers to optimize.

It also ensures that calculations are done in the correct order and removes the necessity of pushing and popping operands to the stack.

Armed with knowledge, we can now get on with what I call tagging.

Tagging is the process by which we attach these unique identifiers to each operation. Variables, parameters, binary operations and unary operations all need to be tagged.

Oddly, perhaps, even if statements need to get tagged and you may wonder why that is. Well, if statements in Calc are not statements. They’re expressions. Like if expressions in functional languages, if expressions in Calc return a value.

As part of escape analysis and type checking, the value of any branch in an if expression must be checked and tagged since it’s result may be used elsewhere in the code.

As you can see, Folding constants can potentially save us time by reducing the number of operations needing an ID.

We can traverse the tree in any manner we chose provided that every ID is unique.

One more stop left to go. Code generation!

Monday, 9 November 2015

Compiler 3 Part 5 - Constant Folding

The only optimization included in series 2 was constant folding. Unlike in the previous series, we’re now much better equipped to handle this optimization.

I feel that I should point out that most C compilers do this step, too. I was somewhat reluctant to keep it in at first. However, my reasoning for keeping it is two-fold: one, I think learning a bit about optimizations on an IR is a worthy lesson; and two, it can help make the next step a little quicker.

Once again, the IR comes to the rescue! In the first step of transforming the AST into the IR we created constants. These were objects representing actual values. This crucial step makes it much easier on us now.

Unary and Binary operations are ideal candidates for folding. We have already done the work of verifying types and correctness so we can ignore all that now. When we check if a value is a certain type we can be confident that information is correct.

Looking at the code you can see that it’s pretty simple. Traverse the tree depth first. If both operands of a binary object, or the single operand of a unary object, are values of the correct type we can fold them together. We then return the result as a new constant value and replace the previous object with it.

Moving back up the tree we repeat the process until we exhaust all the foldable values.

You can see the value in converting basic literals into constants when building the initial IR tree. It makes folding constants together much, much simpler.

Onward and upward!

Sunday, 8 November 2015

Compiler 3 Part 4 - Type Checking

Whether you chose to produce assembly or an intermediate language, the job of the IR is to take us ever closer to generating actual output. We need to do the crucial step of type checking. Since Calc’s type system is strong and static it is necessary to ensure type safety before moving on.

Each struct in the IR tree has an underlying object. This object provides a Pos() function that maps the remaining constructs back into the original source code. This will be useful to us when we are doing type checking so we can provide well formed error messages.

We must traverse the tree and verify that expected types match. Not all objects currently have a type, like variable declarations using type inference. This means that we don’t actually know the type of the variable when parsing the source language since the type isn't explicitly declared but inferred from the object being assigned to the variable.

Similarly, binary expressions do not have a defined type when parsing. Your language may allow adding (concatenating) strings, in which case an add binary operation on strings would have the type string. In the case of Calc, logical comparison operations make a binary expression type bool. Don't forget that most languages have different integer types so your binary expressions will need to take on and verify these types.

This requires we traverse the tree depth first to identify the types of certain structures prior to verifying their correctness.

In addition to verifying type correctness, the type-checker also does some additional tests…

checks that function parameters match the number of arguments in function calls;
verifies that an identifier matches the right kind (variable vs. function names);
checks for undeclared variable and function names.

While not true of the current version of Calc, many languages have varying bit sizes of integer types. Somewhere during this process you would will need to verify if a particular value fits within the bounds of a particular type. If the source has a value greater than 255 being stored into an unsigned 8 bit type, this is a problem. When and how you chose to handle this error is up to you.

Type checking may also involve a level of escape analysis. You need to verify that any value being returned anywhere in the tree conforms to the parent type. In imperative languages, this means looking for statements like “return”. For functional languages, you need to search for terminal points like the "then" and "else" clauses in an "if" expression.

Once we are sure that everything is type safe and correct, we can move on.

Saturday, 7 November 2015

Compiler 3 Part 3 - Intermediate Representation

After presenting Calc2 I came to discover that many parts of the compiler were quite fragile. The type checking was particularly bad, occasionally reporting type errors (false positives) when run over the exact same source code multiple times.

That’s a serious problem.

I had assumed that using an intermediate representation was strictly for doing optimizations. While they certainly help with, and may be crucial for, optimization they can be useful for much, much more.

The IR used by Calc is nothing special. In fact, it looks very much like the AST and the objects it uses share many of the same names. So what gives?

If you look more closely, there are some crucial differences. For one, each object is represented a little differently. Second, there is much more in common between each object. In fact, there is a struct embedded into every one: object.

This struct gives me access to almost everything needed to perform various tasks, including: type checking, error reporting and code generation.

While the AST and IR may both be trees, and represent much of the same data, they represent two very distinct things. The AST, as the name implies, represents the actual syntax of the language. Using it, you can somewhat faithfully recreate the original source code. The IR, on the other hand, sits much closer to code generation and thereby represents the code we wish to output.

I look at the IR like this: it acts as a bridge between (intermediate) the syntax and code (representation) generation.

The first step is transforming the AST into IR. This is done with a simple tree walking algorithm, converting each node into the new form. Typically, the compiler starts with MakePackage. From there, you can follow the calls along to see how the tree is built.

There are a few things worth noting.

First, notice that the parameters of functions get separated into objects stored in a function's scope and only a string slice of parameter names is stored with the declaration. This will become important in the next few steps.

Second, we can also begin a simple optimization of converting literals, like number and boolean strings, into actual values called constants. This saves us the step of doing these conversions later on and provides an opportunity to do constant folding in a much more efficient manner.

Once we’re done converting the AST into the IR tree, the real work can begin. Stay tuned!

Friday, 6 November 2015

Compiler 3 Part 2 - C Runtime and Code Generation

Before diving into the real meat and potatoes of the changes that led to the runtime being (temporarily) removed, I thought I should discuss the motivations.

The runtime irked me greatly because I felt that I did not leverage the power of C nor the C compilers. There were many, many improvements needing to be made and as I began working on them I came to a realization.

The runtime, at least in its current form, is not very useful.

Unlike in a language like Go where the runtime serves a very real purpose (managing goroutines, the garbage collector, etc) the Calc C runtime merely tried to mimic what it’s like to program in assembly.

Of course, there are reasons to re-add a runtime further down the road. Runtime checks, like bounds checking, are one such reason. Stack tracing is another wonderful motivation to implement a runtime. Built-in functions are also common and useful.

I still maintain this was a worthy exercise, long-term it just wasn't viable. The more I looked at the runtime the more I knew I could do better.

The largest hurdle was removing the stack. How was I to handle the call stack? As it turns out, it’s not too hard to use C-style function calls but it did require some changes that the series 2 code base just wasn't equipped to handle elegantly.

I also needed to push and pop data on and off the stack. So how could I rid myself of it?

I have been researching on how to use C as an intermediate representation for many months. Generating C function calls and declarations were proved difficult with the Calc2 compiler and that was one of the motivations for using an assembly-like approach.

Ultimately, it became more and more clear that I was missing a step that would make my life a lot easier.

That secret, if you will, was intermediate representation.

After much research I ran into some potential solutions: three address code and static single assignment. I have been aware of these languages and others like them (Gimple and RTL) for a while.

Using them seemed complicated and I didn't think they were worth the effort of learning about. I wasn't writing an optimizing compiler so what advantages did they provide that I would need.

While not using any of these representations directly I do take advantage of what they provide. More on that next post.

Wednesday, 4 November 2015

Compiler 3 Part 1 - Introduction

I return to you once again with continued improvements and changes to the Calc programming language compiler. In the intervening months (almost a year!) since I completed the last series I have been busy with many other projects. However, I keep coming back to the compiler and in the last few months have been busy tinkering away.

I have taken a different route in this next iteration of the series and you could consider it a partial rewrite of the previous one. There are several knobbly sections in series two that I wasn't particularly happy with and in this third series I hope to address most, if not all, of them.

I also intend to make this third, and any potential future series, much shorter and narrower in scope. Please feel free to ask me questions as I will be glossing over some of the finer points.

So what has changed?

Well, the language itself changed almost not at all. The only additional language feature is the addition of the boolean type. So, nothing too exciting there.

What did change was the entire back-end.

In series two, the code generator operated directly on the AST. It proved to be cumbersome and error prone. The compiler now uses an intermediate representation for the code generator that reduced the LOC from 544 to 255. Less than half the LOC while producing better, more easily optimized code.

That does not even include the removal of the runtime library!

Yes, the entire runtime has been removed which included many hundreds of lines of code being deleted. The Calc language went on a diet!

The net gain is a much better compiler and I want to share those efforts with you all.

Stay tuned!

Series 1 - Novice Compiler Design

Series 2 - Apprentice Compiler Design

Saturday, 3 May 2014

Compiler Part 10: Compiling to C

Part 1: Introduction
Part 2: Compilers, Transpilers and Interpreters
Part 3: Overview of Compiling
Part 4: Overview of Language Design
Part 5: Calc 1 Language Specification
Part 6: Tokens
Part 7: Scanning
Part 8: Abstract Syntax Tree
Part 9: Parsing

The final stage at last!

With our language spec being so simple I could have easily skipped C and gone straight to Assembly. I have two reasons for not doing so. First, is portability. I don’t have to write any architecture specific code in C for this tutorial. C has been ported to lots of different systems so I’ll let your C compiler do the work for me.

Second, assembly is going to be a lot more foreign to many programmers than C. Even if you've never written anything in C before it will be easier to understand than Assembly.

Compiling

This should be starting to look familiar. We’re going to be walking the AST and either generating C code or doing a calculation.

The output generated by our compiler will go into a file which shares the same name as the source file with a “.c” extension appended to it. CompileFile creates the output file, parses the source code and starts the output process.

As with the parser, the first element of the AST will be the File object. While the File object itself doesn't contain anything useful outside our root expression, it does provide us with the opportunity to produce some output.

This part actually isn't that exciting. It’s just a bunch of boilerplate. We provide an entry point for the program, an exit status, and a printf statement which requires we also include the C standard input/output header.

Optimization

I said at the beginning of this series that I wouldn’t be talking much about optimization, and I won’t. That said, our language is ideally suited to one type of optimization: pre-calculation.

We could go through the task of emitting code for each element of our tree but to what end? It is surprisingly complex and provides no benefit. Instead, why not do all the calculations at this stage and make the final output simpler and faster?

If the root expression is a number then we pass that number to printf. If it’s a binary expression, we can have some fun.

First stop is a generic function called compNode where we determine what kind of object we have: a BasicLit or a BinaryExpr.

If it is a basic literal, which we know is an integer, we simply convert it to an actual integer and return the result.

The binary expression is just as easy. The first element of the list of expressions is always the start point. Order of the operands is important for some operations like division and subtraction. The first element of the expression list is the starting point and is stored in a temporary variable (as an aside, this would be like using the eax or rax register in assembly).

Then, based on the operator, we compute the result and store it back into the same variable for each additional operand. Once complete, the result is returned. This is done recursively until everything is resolved.

Final Stage

Once the compiler is complete there’s still more work to be done. First off, we need to create a command to read in the Calc 1 source code and call the compiler library. Next, a C compiler (likely gcc or clang) should be run on the output from the compiler command. Finally, depending on how the C compiler was run, you might need to run a linker on the object code output from the C compiler.

This is all handled by a program called calcc, which stands for Calc Compiler. Another example of my brilliant naming skills.

I don't think there's anything spectacular going on in this program. It merely opens the input file, verifying that it has the .calc extension, and calls comp.CompileFile on it. It then uses exec.Command from the Go standard library to execute the C compiler and linker.

It has several flags which can be used to control the C compiler a bit.

Last Words

I sincerely hope I've helped someone learn a little something about compiling. It is a massive subject and it can be pretty overwhelming.

For some, this series is probably a little too simple and for that I’m sorry. I hope you can bare with me when I take things to the next level in Calc 2.

There is a tremendous amount I've skipped over and I hope to continue in the next series to start addressing these shortcomings. Topics I hope to include in Calc 2 are:

symbol tables
scope
function declarations
variable declarations
comparisons and branching
variable assignment
typing
stack memory

I will be basing Calc 2 directly on the code from Calc 1 so anything you've learned here will, I think, be directly applicable in the next series.

If a Calc 3 comes to fruition, and I hope it does, I certainly plan to tackle assembly. If I do, then I might include a few articles solely on assembly itself, possibly as a separate series prior to Calc 3. Other possible ideas are: objects, methods, multiple-assignment, multiple files and libraries.

Thanks for reading!

I bid thee, adieu!

Compiler Part 9: Parsing

Part 1: Introduction
Part 2: Compilers, Transpilers and Interpreters
Part 3: Overview of Compiling
Part 4: Overview of Language Design
Part 5: Calc 1 Language Specification
Part 6: Tokens
Part 7: Scanning
Part 8: Abstract Syntax Tree

We've come a long way so far. We've covered the basic principles of scanning and the abstract data tree. Now, we can finally move on to parsing.

If you’re struggling to keep up at the point then I’ll warn you that it just gets harder from here on out. Parsing is probably the hardest part to wrap your head around conceptually. We will be taking the lexemes our scanner detects, give them meaning and storing the resulting objects into the AST.

Make sure you understand the previous material before moving on.

The Parser Object

The parser for our little language is pretty simple. Like our scanner, we have a pointer to the information on the file we’re scanning. We have a way to track errors and a scanner object.

The last three fields directly reflect the same three values returned by the scanner: the position, the token and the literal string.

Entry Point

ParseFile is where the magic begins. We initialize the parser and begin scanning. If one or more errors are encountered we stop and print out the errors. Otherwise, we return the file object, which is the entry point into the program.

Just like with the scanner, init primes the pump to get things started. Any calls to any other parser functions will advance the parser and scanner and we can’t go back. Fuel is in the line so lets start the engine!

parseFile is a companion to it’s exported cousin. This function creates the first object for our AST. Going back to Part 5 yet again (see how important grammar rules are?) we can see that a file object is the root of everything. That’s exactly what we've done. The very first object is the File object.

Note that we don’t really check anything before proceeding to find the first expression. That’s because our grammar rules tell us this has to be true. Anything else is wrong. We expect to find an expression of some kind and we’ll be mighty upset if we don’t get one so we’ll just plow on ahead!

Last, we ensure that the very last thing in our file is the End of File (EOF) token. Our grammar rules show that after we find our root expression there should be nothing else. Report an error if there if something else is found and parade it around for everyone to look at and ridicule!

Expressions

We've discussed what an expression is a few times. Now, our job is to find one with parseGenExpr but we don’t know what kind to expect at first. Well, the first token we find is going to tell us all we need to know. If we find an opening bracket then we have a binary expression. If we find an integer, that’s what we have. Otherwise, we generate an error and move on.

The integer is the easiest element to parse. There’s no point into going into detail on it. The code is self-explanatory.

However, the binary expression is a bit trickier. Calc 1 only has binary expressions but later on we’re going to have more types of expressions. Just like with the Expression object we should have a generic way of handling expressions in general before dealing with the specifics. parseExpr handles this.

First off, we expect to find a left paren. It’s bad if we don’t. Next off, we need to figure out what kind of expression we have. We know that the only expression we have is a binary expression so we’ll test to find out what operator is the next token. If we get something else, we need to report an error.

The BinaryExpr exemplifies the recursive nature of our parser. We've already found the opening paren and operator so now we need to find our operands. This is done via a loop which recursively calls parseGenExpr. We keep building up our tree deeper and deeper until we reach the end.

After we find all the operands we expect a right paren to close the expression. Finally, we return the resulting BinaryExpr object and it gets inserted into our tree.

Expect, Next and AddError

Expect is a great little utility function. We tell it what token we expect to find. If it’s there, great. If not, we report an error. Regardless, we return the position of that element.

Next doesn't really need explanation. It just takes exactly what the scanner finds and stores it in the parser object.

AddError appends a new error message to our ErrorList. If more than ten errors have been found there’s really no point in continuing. It prints the errors and forces the parser to exit with an error code.

Syntax Analysis

If it isn't clear, our parsing stage does syntax analysis as it works. It ensures that the source code we’re parsing adheres to our grammar rules.

Anything less is unacceptable. No, seriously.

Finished

That’s it for this stage. If you got through this then the final bits should be easy. The parser will get more and more complicated as our grammar gets more in-depth. Calc 2 ought to serve as a good example of that.

Compiler Part 8 - Abstract Syntax Tree

Part 1: Introduction
Part 2: Compilers, Transpilers and Interpreters
Part 3: Overview of Compiling
Part 4: Overview of Language Design
Part 5: Calc 1 Language Specification
Part 6: Tokens
Part 7: Scanning

Before we can build the parser we need to take a detour and talk about how we’re going to handle the data we’re going to process.

We need to use some kind of abstract data type to hold all the data we’re parsing. As it turns out, a tree data structure is perfect for our needs. This tree describes the syntax of our programming language and is appropriately called the Abstract Syntax Tree (AST).

AST

Tree data structures always start with a root and ours is no different. Typically, in a full-fledged compiler, you’d probably have an object which represents a package or program. In our case, we only have one file so we’ll have an object called File.

The rest of the objects we need to create can be found in our grammar blueprint. I refer you, again, back to Part 5 where we created our grammar rules. This handy layout tells us exactly what we need. Aren’t you glad we created it? It’s also going to come in very handy when we start parsing and syntax analysis.

The other objects we’ll need to create: an expression and an integer.

Go Interfaces

Before we create our objects we need to consider what they’re going to look like. What is it we know?

We know that our scanner is going to provide us with three things:

The literal string of the element we've scanned. A number might be “42”.
The token representing the element. An opening parenthesis would be token.LPAREN.
The position at which the element starts. A number between the file’s base and base + size.

One thing that will be common to all objects in our tree is their position. Where do they start and where do they end? Error reporting, for example, won’t care what kind of object it has. It only cares what it’s position is. So, we create a Node interface which has two methods: Pos, for the start position, and End, for the end position.

Next, there is a recurring element in our grammar rules: expressions. An expression can either be an integer or a sequence of elements encapsulated in parentheses. The interface will be called an Expr. Any Expr objects must implement the method exprNode as well as fulfill the Node interface.

We can now pass our objects around between different parts of our compiler based on their interface and use type assertions to discover their exact type, if required. When we get to parsing, this ought become more clear.

Objects

There are three we have to implement as well as one extra helper object:

We’ll start with the Integer. We have to be a bit forward thinking with our objects and the integer type is no different. We could spend a lot of time creating an object for each type in our language. We could, however, just create a single object to hold any basic type.

BasicLit is just such an object. Whether we have an integer, float, string, or character it doesn't matter. This object satisfies our needs for all of them. Of course, we only have integers at the moment but eventually we want to add more. This object holds the token telling us what type of object it is, the position where the literal starts and the string representing the literal.

Skipping over the Expression for a moment, we’ll move on to File. This object is even simpler. This is where it all starts. It’s our starting point and has one field, Root. Looking back at our grammar rules we can see that a file holds an expression and that expression is the root of everything else.

Our expressions are of a specific type right now. They’re binary expressions. Mathematics. We have several elements we have to keep track of: An operator, opening and closing parentheses and two or more operands which are, in of themselves, expressions.

A binary expression won’t always be our only type of expression. Eventually, we’ll want more. One thing that will be common to every expression in our language is an opening and closing paren. That’s where the generic Expression object comes in handy. It’s only job is to handle the parens. This object will be nested into all our sub-expressions.

The beauty of Go is that any objects with an embedded type sort of “inherit” the nested object’s methods. It’s not true inheritance but rather the compiler being clever. Regardless, the result is the same for our purposes.

As already stated, the BinaryExpr must have an operator and a list of operands. We have fields that hold the operand and where it is located is the sources. Our operands are held in the field List which is a slice of objects which fulfill the Expr interface.

AST Utilities

If we need to inspect our AST, a method of walking the tree would be handy. The simple Walk function does just that.

Printing the AST is done via the Print method which merely Walks the tree and prints out some information about each node in order.

Moving On

We’ll finally get to parsing soon but I want to be sure the code for the AST is clear. There’s that nasty recursive definition of an Expression and likely some methods that don’t provide any context to their use. Make sure you review this article and any previous posts before moving on.

Thankfully, parsing ought to make things more clear how all the pieces fit together.

Compiler Part 7 - Scanning

Part 1: Introduction
Part 2: Compilers, Transpilers and Interpreters
Part 3: Overview of Compiling
Part 4: Overview of Language Design
Part 5: Calc 1 Language Specification
Part 6: Tokens

Now, we can finally work on the scanner.

Lexical Analysis

So, where do we start?

That was the hard part. Scanning, to me, seemed like it should be easy but I quickly got lost in the details. There are many ways to implement a scanner and I’ll show you how to do one type. Rob Pike did a great presentation on another cool method in a talk called: Lexical Scanning in Go.

The basic concept of this scanner is that we start at the top, moving left to right, until we reach the end of the source code. Each time we find an element of interest we’ll report the literal string found, a token telling the parser what it is and the position at which it was found.

Finite State Machine

Now, I’m not going to go into any real detail about things like a Finite State Machine (Finite Automata) or anything. You should investigate this yourself. Coursera has a course on compiler design you may find of help, too, that covers this topic. The concepts are important but its not strictly necessary to know every little detail (though I do encourage you to learn them).

The basic idea here is that we have a finite number of states which our scanner can return. These states are represented by tokens and since we can only return the limited tokens we’ve defined, we can say that our scanner has finite state. Hence, finite state machine. Understanding automata comes in handy when understanding regular expressions and/or when accepting or rejecting an individual character being scanned.

This will all become clear shortly.

Oops

I’d like to make it very clear one of the mistakes I made on my first attempt at writing a scanner. Writing any part of the compiler without some kind of solid definition for your language is a terrible idea. If the core design of your language is still fluid then you’re going to be rewriting your compiler a lot. And I mean, a lot. I had to rewrite my first attempt at an interpreter several times, pretty much from scratch, each time I revised my language. A complete waste of time.

It was this process that finally made me realize how bad decisions are made. What at first seems like a good idea might turn out to be a poor idea later but not making a decision at all will end up being disastrous. Many times I've criticized a language’s design asking myself, “Why in heck would you have done THAT? It’s so stupid!” Hindsight is 20/20 my friends.

The Scanner

The scanner is fairly simple. We start with a simple object which tracks things like the current character scanned, the offset from the beginning of the file, something we’ll call the reading offset, the source code being scanned and a pointer to the details about the file itself.

The first step to scanning is to initialize the scanner via Init. There’s nothing very special here outside the call to the next method, which I call “priming the pump.”

The next method is a funny little function. It first resets the current character to zero to indicate the end of the file. If the reading offset is less than the length of the file then the offset is changed to the reading offset. If a newline is encountered we make note of it’s location in the file object. Note that we discard the newline but still note its location. Finally, we update the current character and increment the reading offset.

Reading Offset and Unicode

What’s the deal with it? It mainly has to do with Unicode. A single character may take one or more bytes so you can’t just increment the offset by one each time. The UTF8 package’s DecodeRune function returns the number of bytes in the next character. The reading offset is used in that situation to mark the start of the next rune to read.

While this scanner won’t be Unicode friendly we can still start incorporating some of the functions we’ll end up needing so that we’ll have less work to do later when we add it. We’ll also be using the IsDigit and IsSpace functions from the unicode package.

Scan

This puppy is the meat and potatoes of the scanner. The Scan method starts by skipping any whitespace. The method skipWhitespace simply advances the scanner one character at a time until it reaches a non-whitespace character. I've utilized the unicode.IsSpace function to achieve this goal.

Next, it looks for multi-character elements. In this case, we are just looking for numbers. After that, we look for single character elements and finally wrap things up by reporting either an illegal character or the end of the file.

After each pass we always need to increment the scanner to the next character via a call to next and return the results of the scan.

We should also have our handy language spec at our sides. It tells us exactly what to do. Head back to Part 5 if you need to find it again.

Integers

If we encounter a digit, we scan for a longer sequence via scanNumber.

I've chosen to use the unicode.IsDigit function but we could have just as easily written our own implementation. Something as simple as: return s.ch >= ‘0’ && s.ch <= ‘9’ would have sufficed. scanNumber works by continually advancing the scanner until a non-digit is found. The if statement after the loop handles the instance that a number occurs at the end of a file.

In a later version of Calc, or on your own, I will expand this function to include various other numerical forms like floating point and hexadecimal.

More Scan

If a digit was not found, we move on and check for single characters. These should all be pretty self-explanatory with maybe the exception of the semi-colon and default.

Comments start at a semi-colon and go to the end of the line or end of the file. In the current incarnation of the scanner comments are discarded but it would be just as simple to keep them. For instance, the Go scanner reports any comments it finds and passes them along to the parser so that the parser can create those wonderful Go docs we all know and love!

Wrap Up

That’s all there is to the scanner. It’s fairly straight forward.

On to parsing!

[Edit: Add link to Rob Pike's talk on Lexical Scanning in Go]

Compiler Part 6: Tokens

Part 1: Introduction
Part 2: Compilers, Transpilers and Interpreters
Part 3: Overview of Compiling
Part 4: Overview of Language Design
Part 5: Calc 1 Language Specification

In this post we can finally start to lay down some code!

Tokens

Previously, we discussed our grammar and the set of tokens we’ll need to scan for. We defined an expression, a number and operators. We also identified we expect to encounter some parentheses. We should also let our parser know when the scanner has reached the end of the file.

Before we can start scanning we need to formalize our code for the tokens our scanner will be emitting. Our tokens are going to be used in almost all stages of our compiler. If we want to create tools like Go’s fmt or vet programs we might need to reuse these tokens.

Here’s the first section of code: https://github.com/rthornton128/calc/blob/calc1/token/token.go

The constants might look a bit funny at first. There are some lower-case, non-exported tokens intermixed with upper-case, exported ones. The non-exported tokens will help us with writing some utility functions and allow the language to expand without modifying any other code.

https://github.com/rthornton128/calc/blob/calc1/token/token.go#L36

Next comes a mapping of Tokens to strings. I could also have used an array of strings but I didn't. It makes writing the Lookup function easy.

https://github.com/rthornton128/calc/blob/calc1/token/token.go#L50

The rest are some utility functions. You can see with IsLiteral and IsOperator where our non-exported constants come in handy. No matter how many new operators or literals we add we never have to adjust these functions. Handy!

https://github.com/rthornton128/calc/blob/calc1/token/token.go#L58

Lookup, String and Valid help us when generating error messages.

Positions

This file might take a moment to wrap your head around. I’ll try and take it slow.

When scanning, we start at the first character in the stream, starting at the top and working down, moving left to right. The first character is at offset zero.

By comparison, when reporting an error a user wants to know the line and column at which the error occurs. The first character would be line one, column one. We therefore have to translate the position of the character offset into something meaningful to the end user.

A position, Pos, is the character offset plus the file’s base. With a base of one, a character at offset zero has a Pos of one.

It is illegal to have a position of zero because it would be outside of the file. Consequently, it would also be illegal to have a position greater than the base of the file plus it’s length.

Why something so complicated you might be wondering? Well, once we need to parse multiple files it can be quite confusing to identify in which file an error occurred without a lot of overhead. The Pos makes it easy. More on this in the next series.

The Position type is strictly for error reporting. It allows us to output clear information about which line and column the error happened at and in which file it occurred. At this stage we’ll only ever be dealing with a single file but we’ll be grateful for this code later.

Files

File is, strictly speaking, completely unnecessary for writing a compiler but I feel that good, clear errors messages are vitally important. Go does a pretty good job but many compilers don’t. The GNU C compiler, for instance, used to be horrid. It has improved a lot over the years.

It provides a framework for something yet to come. This code is really just laying the groundwork for when we have to process more than a single file.

It’s sole purpose is essentially for error reporting and goes hand-in hand with the position code. Again, since we only have one file the base, or start, position is one. It can’t ever be less but in the future it could be higher. Don’t worry about that now, though.

Each time the scanner detects a newline character, we want to add that position to the list of newlines in the file. This allows the Position function to accurately calculate where an error occurred and to report it’s position.

Summary

That about covers the token section. As promised, I've not talked much about the support code.

I suggest referring back to this code often. Once you begin to understand how it all works together it will make a lot more sense. This library gets used pretty liberally throughout the compiler so we'll be coming back to it often.

Compiler Part 5: Language Specification

Part 1: Introduction
Part 2: Compilers, Transpilers and Interpreters
Part 3: Overview of Compiling

Part 4: Overview of Language Design

It's finally time to cover the design specification for Calc!

Defining the Language

I want to keep the language as simple as possible. I called the language Calc. You know, for calculator. Clever right? Right? Ok...uhm, moving on!

I also want a single, basic type. I decided to go with something equally as clever as the language’s name and called it an Integer. Brilliant, I know. Your admiration is duly noted. I decided to avoid dealing with floating point numbers for simplicity's sake as well as octal, binary, hexadecimal, and scientific notations. I leave it as an exercise for yourself to add these other notations in.

We’ll also need to specify the end of a file, a comment and mathematical operators.

It will also be essential to know operator precedence and encapsulation when we start to parse and evaluate our language. Call it a blessing or a curse but I decided at this time that I didn't want to mess with operator precedence. So, I've decided to use prefix notation and Lisp style expressions. This means that the operator will precede the arguments and be encapsulated in a left and right parenthesis. An example of adding two and three would look like:

Example 1 (simple expression)

(+ 2 3)

Result: 5

This has two advantages: 1) operator precedence becomes explicit; and, 2) we can perform an operation on an arbitrary number of values.

Example 2 (operator precedence):

(* (+ 3 1) (/ 4 2))
Result: 8

There’s no question here what operations are performed in which order. The addition and division operators are calculated independently, prior to the multiplication. No BEDMAS for you!

Example 3 (multiple values):

(* 1 2 3 4 5)
Result: 120

In this case, we process each element of the expression, moving from left to right. So we could rewrite this function in the equivalent following forms:

(*5(*4(*3(*2 1)))) == ((((1*2)*3)*4)*5) == 1*2*3*4*5

So, we need to define tokens for the left and right (open and closing) parentheses, and each mathematical operation we want to perform.

Last, we ought to throw in the ability to make comments. We’ll stick to single line comments but multi-line isn't that hard to add. Let’s use a single semi-colon to denote a comment.

The Tokens We Need

Left Parenthesis
Right Parenthesis
Addition Operator
Subtraction Operator
Multiplication Operator
Division Operator (Quotient)
Modulus Operator (Remainder)
Integer
Semi-Colon (Comment)

That’s everything we need to start lexing. The diligent will note that I have not defined tokens for any keywords, built-in functions or variables. That’s because our simple language doesn’t have any.

The point of this initial series isn’t to teach you to design a turing complete, fully functional programming language. The sole purpose is to give you the basic techniques to write an initial, working compiler.

I am currently writing a follow up series discussing how to add these features.

Whitespace

The next thing we have to decide on is how to handle whitespace. In languages like Lisp or Go, whitespace is mainly ignored. I say mainly because it’s not completely ignored. Strings and comments don’t ignore whitespace. It is also used to separate elements like numbers. However, it is ignored in the sense that is is not important to the syntax or semantics of the language.

Whitespace, incidentally, is typically a space, tab or newline. On Windows, you’ll want to toss carriage returns in with that lot, too. Basically, any non-visible characters you can type on a standard keyboard.

Now, your language could take whitespace into consideration. In Python, for example, it is very important, where indentation indicates statement blocks. Calc will not.

Syntax

We've outlined the elements our code will contain. Now we need to provide some meaning to it.

We are largely concerned with expressions in our language. We make no statements (terrible joke, I hope at least one person gets it). We want to express mathematical equations.

Each expression must be encapsulated in brackets, begin with an operator and must contain two or more operands. An example:

“(“ operator operand1 operand2 [… [operandN]] “)”

I've put the parentheses in quotations to signify they are a literal, required element which do not have extended grammar rules. The operator and first two operands are also required elements. The square brackets signify the additional operands are optional.

What can our operators be? They may only be one of the following symbols: + - * / %

How about our operands? Well, now things get a little trickier but I think we can handle it. Our operands can either be number or another expression. This allows us to nest expressions for making complex calculations.

I’d like to express our syntax in another form. It’s a grammar for defining something called a context free grammar. This particular one is a variant on Backus-Naur Form (BNF) with some regular expressions thrown in. Using this notation, we can express our language like this:

file ::= expr
expr ::= “(“ operator whitespace expr (whitespace expr)+ ”)” | integer
operator ::= [+-*/%]
whitespace ::= [ \t\n\r]+
integer ::= digit+
digit ::= [0-9]

I am no expert on context free grammars so hopefully I haven’t cocked it up. As you can see, it represents the requirements listed above quite succinctly and will make designing our parser much easier than without it.

Final Notes

Our language is quite simple but the grammar helps to make it clear what we expect and will make writing our scanner and parser much easier.

Looking at the integer definition, it’s quite clear what we need to scan for. We will need to find one or more digits between zero and nine. The same is true of white space where we need one or more spaces, tabs or newlines.

When parsing we know exactly what to expect. Our program consists of a single expression. Each expression is enclosed in brackets and contains exactly one operator which must be the second element of the expression. The operands may either be an expression or a number. We must have at least two operands.

Compiler Part 4: Language Design

Part 1: Introduction

Part 2: Compilers, Transpilers and Interpreters
Part 3: Overview of Compiling

In part one we had a quick introduction to what this series is going to be about. In parts two and three I gave a very brief, non-exhaustive overview of the steps involved in compiling a computer programming language.

In this post, I’m going to be a bit more specific rather than the broad overviews in the last few articles. I’m going to discuss the language specification for our language.

Language Design Overview

Did I not just finish saying I was going to stop with the overly broad overviews? Turns out, I’m a dirty, rotten liar.

I’ll try and be as brief as possible so we can get into laying down some code but I want it to be clear that this is a very incomplete view of language design. Computer language design is a topic you can devote your life to. In no way am I going to give you any theory behind the design choices in Calc. I’m not going to discuss much at all.

In fact, I’m basically going to just tell you what the language specification is.

I want to be completely frank that Calc 1 isn't even really a programming language. It’s sole purpose is to teach the basics of building a compiler and act as a spring-board for future specifications of Calc to be covered in a future blog series.

Language design is hard. Very hard. It’s easy to say: “I want this feature, that feature and those features.” Why? For what purpose? How will it complicate the design? Could it confuse users? Is it necessary? Who is your target audience and would they want or need the feature?

Loops

These are not questions that are easy to answer even with tremendous amounts of experience. Some features, like looping constructs, might appear like a no-brainer but are they? Let’s take the two most common loop constructs: for and while.

Almost every computer language ever designed has some method of looping, whether it’s a jump like a goto statement or the very common for and while loops. Recursion is another form of looping. The most popular tend to be for and while.

First question, first: Why?

Well, we want to repeat an action. OK, pretty easy.

For and while both require an exit condition to complete. We usually associate these conditions with the logical true and false. Both have slightly different signatures:

while(condition) do { action }

for (start; condition; increment) do { action }

Why? Well, originally, I think it was logical to think of these loops as different. The Go developers, however, saw some commonality between the two constructs and asked themselves this very question: "Why?"

Why have two keywords, two reserved names, to essentially do the same thing? It might be harder to parse but wouldn’t it be nice for the end user if they just had to use one looping construct and only a single, reserved keyword instead of two?

The Go developers chose the shorter for keyword. I think it’s a more intuitive word which likely had bearing on their decision, too. ‘for (condition) do { action }’ makes sense. Suddenly, our language is simpler and cleaner by challenging a simple premise we’ve long taken for granted. Why use two keywords when one will suffice?

Trade Offs

There’s always a trade off. Complexity for simplicity. Speed for convenience.

The cost of the for loop decision was some mild complexity in parsing. The developers made a conscious choice to make parsing the for statement harder in order to provide a vast improvement in productivity for the programmer.

Generics is a hot topic. Go doesn’t provide any. Why?

Generics are all about trade offs and I can’t even begin to explain them to you. Your best bet is to do some searches in the golang mailing list and look at the FAQ.

The end result is that you need to decide what you want. Do you want a lean and mean language or a complex one that does everything under the sun? Why did you make the choices you did?

Just because a feature is in every other language doesn’t mean it needs to be. The while loop is a perfect example of that.

A switch, in Go, is so much more than it is in languages like C. It’s a simple concept made vastly more powerful without adding much cognitive complexity for the programmer.

Type System

Dynamic or Static? Strong or Weak? Explicit or implicit?

I, personally, prefer a strong, statically typed language. I like to be sure that what I expect is what I get without having to do anything more.

Can, or should, a type be castable? Can a floating point number be an integer? Could a character be a float? What is a string?

Bounds checking on arrays? Slices?

Pointers and addressable variables? Variables passed by reference or by value?

Like I said, something seemingly simple becomes very, very complex quickly.

Summary

I have new-found respect for many languages. I’ll be honest. I don’t like Java. I recognise its strengths but writing code in the language feels like programming with a donkey. That said, I have to respect the designers. I have to respect what they’ve done and achieved. I have to respect how what perhaps was the best, or only, decision back then is a bad idea now. Java does a job and millions of programmers use it. You have to respect that.

You will make bad design choices, no matter how clever you think you are. I’ve followed the Go mailing lists since early 2011 when I first became aware of the language and I’ve seen how the language evolved until it was finalized in its 1.0 release. There have been, and continues to be, many arguments over the design of the language. Language design is HARD and your decisions won’t please everyone.

As I said earlier, language design is a huge, huge topic. Many languages, I think, fall victim to wanting to please everyone and anyone. They have toolboxes full to overflowing. Other languages draw a line and refuse to cross it, even in the face over overwhelming evidence of its necessity. Both types of governance piss people off.

You need to choose which camp you want to be in and why.

Next up: The Calc1 Specification

Compiler Part 3 - Compiler Design Overview

Part 1: Introduction

Part 2: Compilers, Transpilers and Interpreters

What Are We Writing?

A calculator. Sort of. A compiler for a super simple mathematical language.

We are, at least for now, going to avoid the complications of dealing with strings and characters and focus on numbers. Not just numbers, whole numbers. Integers to be specific. This is the only “type” we’ll have in our system, for now.

Adding new types isn't exactly hard, and indeed is desirable, but makes our design much more complicated than need-be at this stage in the game.

Stages of Compilation

Whether you are building an interpreter or a compiler most of the steps remain the same. The most common, basic steps are:

Lexical Analysis
Parsing
Semantic Analysis
Optimization
Code Generation

We’ll attack each of these steps one at a time.

Overview: Lexical Analysis

We've all written code before. It’s a “human readable” representation of what we want our programs to do. We need to take something that is human readable and make it something a computer can understand. The first step, is lexical analysis.

The short of it is, we have to scan through the text and report what was found. That is, we have to pick out all the different unique parts of our language and assign an identifier to them called a token. Each token is associated with a lexeme, the literal string of characters, for handling in the next stage. Literals (numbers, strings, etc), keywords and operators are examples of the kind of things we identify in this step.

For error reporting, we should also provide the position at which the token was found.

Overview: Parsing

In this stage, also called syntactic analysis, we start to give meaning to the tokens we’ve found. Each token is represented by an object and placed into a tree data structure.

The language syntax is verified at this stage. We want to ensure that the order of what we’ve received is correct. In LISP, an expression takes the form of an opening bracket, an operator or function, followed by any number of arguments and a closing bracket. Parsing ensures that these requirements are met.

Overview: Semantic Analysis

Next, we check to ensure the semantics of our language are met. If a variable has been declared as an integer but assigned a string, we have a problem. The number of arguments of a function call not matching the number of parameters of the function signature is another semantic error.

Semantic Analysis is a;sp concerned with things like variable scope and generates errors if a variable is used that hasn't been declared.

Overview: Optimization

This stage does what it says. It optimizes the output, usually for speed or size (memory consumption). I won’t be covering much of this in this series but an easy example would be assigning a variable the result of an expression rather than the expression itself.

For example, what if we assign the variable A the statement ‘2 + 3’. Adding two static numbers could require several steps in code generation. This can be optimized by pre-calculating the result and assigning it in the generated code, simplifying the statement to ‘A = 5’.

Overview: Code Generation

This is the final stage. It outputs the lower level code we want to emit. A Java compiler would output Java bytecode. A C compiler might output assembly. An assembler outputs machine code.

Moving On...

Bear in mind that there can be more or less steps in your compiler. A C compiler, for example, also has a pre-processing step. Some steps, like parsing and semantic analysis, can be wrapped together into a single step. There is a lot to compiler design. For a broader overview, check out this Wikipedia page: Compilers.

It would be great if we could dive into writing a compiler now. Unfortunately, that would be folly. Without knowing what our language requires we’d be at a loss as to what to actually write. Without a set of rules to follow we’d have no direction.

Next up: the language specification!

Compiler Part 2: Compiling, Transpiling and Interpreting

Part 1 served as an introduction to this series.

In this second post, I want to do an overview of some definitions before we dive in to the actual steps of compiling.

Compiling

Compiling is the act of taking code written in one language and translating it into a lower level language. A C compiler doesn't usually output machine code directly. Instead, it translates the C source code into assembly. The assembler takes that output and compiles it into machine code. C# and Java are translated into bytecode. Bytecode isn't transformed into machine code until it is executed by the virtual machine.

It’s important to understand this distinction.

Compiling often consists of using an intermediate representation (IR) or intermediate language. Assembly is a common intermediate language. LLVM has an IR imaginatively called LLVM IR. C can also act as an intermediate language.

Transpiling

Transpiling, by contrast, is transforming code from one language to another language of equal or higher level. Translating Go to Javascript is an example of this.

Do not confuse this with what languages like Scala or Clojure do. Both of these languages are compiled directly to Java bytecode and use the JVM to execute their code. Neither are considered to be transpiled because Java bytecode is a lower level of abstraction. Were they translated into Java first prior to being compile to bytecode then they might be considered transpiled languages.

Translating Go or Lisp to C isn't really transpiling either, though it rides a fine line. C is a lower level language although it is certainly a higher level language than assembly.

Interpreting

Interpreters are usually associated with scripting. Rather than translating a language into another language or IR the language is, typically, interpreted directly.

In some cases this means, quite literally, scanning the code and interpreting it on the fly and halting the moment something isn't right. In other cases, it means scanning through the entire code, validating it and then executing it from the internal tree data structure holding all the information.

In a sense, an interpreter of this kind must “compile” the script or source code each time it is run, making it much slower. This is because it needs to do all the steps a compiler does each time the script is executed rather than doing that process only once.

Many modern interpreters don’t typically do this, however. You may want to investigate Just In Time (JIT) compiling if you’re interested in the details. The short and simple version is that the script is interpreted only once, just like a compiler, and is stored in some kind of intermediate form so it can be loaded faster. This process also allows the designers to incorporate optimizations to further speed up code.

Binaries, Compiling

The objective of compilers is often to create an executable binary, or at least some kind of object the machine can read and execute.

It’s not so simple to just create machine code. Compiling is actually just the first step.

Here are the basic steps that GCC and Go might use to compile their source code:

C using GCC:

Translate C into assembly via GNU C Compiler (gcc);
Translate assembly into machine code via GNU Assembler (gas);
Link machine code with standard library and produce an architecture specific binary via the GNU Linker (ld).

Go using Go Compiler on x86-32:

Translate Go into assembly via Go compiler (8g);
Translate assembly into machine code via Plan 9 Assembler (8a);
Link machine code with standard library and produce an architecture specific binary via the Plan 9 Linker (8l).

As you can see, the compiler is just one step of the process. In these articles we’ll only be dealing with step 1. Assembling and linking are well beyond the purpose of these articles.

Don’t fret though! Our compiler will produce an executable binary!

Now that we've taken care of some definitions and understand the process to making executable code we can look at a compiler in a little more detail in Part 3.