Saturday, 3 May 2014

Compiler Part 3 - Compiler Design Overview

Part 1: Introduction
Part 2: Compilers, Transpilers and Interpreters

What Are We Writing?

A calculator. Sort of. A compiler for a super simple mathematical language.

We are, at least for now, going to avoid the complications of dealing with strings and characters and focus on numbers. Not just numbers, whole numbers. Integers to be specific. This is the only “type” we’ll have in our system, for now.

Adding new types isn't exactly hard, and indeed is desirable, but makes our design much more complicated than need-be at this stage in the game.

Stages of Compilation

Whether you are building an interpreter or a compiler most of the steps remain the same. The most common, basic steps are:
  1. Lexical Analysis
  2. Parsing
  3. Semantic Analysis
  4. Optimization
  5. Code Generation
We’ll attack each of these steps one at a time.

Overview: Lexical Analysis

We've all written code before. It’s a “human readable” representation of what we want our programs to do. We need to take something that is human readable and make it something a computer can understand. The first step, is lexical analysis.

The short of it is, we have to scan through the text and report what was found. That is, we have to pick out all the different unique parts of our language and assign an identifier to them called a token. Each token is associated with a lexeme, the literal string of characters, for handling in the next stage. Literals (numbers, strings, etc), keywords and operators are examples of the kind of things we identify in this step.

For error reporting, we should also provide the position at which the token was found.

Overview: Parsing

In this stage, also called syntactic analysis, we start to give meaning to the tokens we’ve found. Each token is represented by an object and placed into a tree data structure.

The language syntax is verified at this stage. We want to ensure that the order of what we’ve received is correct. In LISP, an expression takes the form of an opening bracket, an operator or function, followed by any number of arguments and a closing bracket. Parsing ensures that these requirements are met.

Overview: Semantic Analysis

Next, we check to ensure the semantics of our language are met. If a variable has been declared as an integer but assigned a string, we have a problem. The number of arguments of a function call not matching the number of parameters of the function signature is another semantic error.

Semantic Analysis is a;sp concerned with things like variable scope and generates errors if a variable is used that hasn't been declared.

Overview: Optimization

This stage does what it says. It optimizes the output, usually for speed or size (memory consumption). I won’t be covering much of this in this series but an easy example would be assigning a variable the result of an expression rather than the expression itself.

For example, what if we assign the variable A the statement ‘2 + 3’. Adding two static numbers could require several steps in code generation. This can be optimized by pre-calculating the result and assigning it in the generated code, simplifying the statement to ‘A = 5’.

Overview: Code Generation

This is the final stage. It outputs the lower level code we want to emit. A Java compiler would output Java bytecode. A C compiler might output assembly. An assembler outputs machine code.

Moving On...

Bear in mind that there can be more or less steps in your compiler. A C compiler, for example, also has a pre-processing step. Some steps, like parsing and semantic analysis, can be wrapped together into a single step. There is a lot to compiler design. For a broader overview, check out this Wikipedia page: Compilers.

It would be great if we could dive into writing a compiler now. Unfortunately, that would be folly. Without knowing what our language requires we’d be at a loss as to what to actually write. Without a set of rules to follow we’d have no direction.

Next up: the language specification!