Saturday, 3 May 2014

Compiler Part 5: Language Specification

Part 1: Introduction
Part 2: Compilers, Transpilers and Interpreters
Part 3: Overview of Compiling
Part 4: Overview of Language Design

It's finally time to cover the design specification for Calc!

Defining the Language


I want to keep the language as simple as possible. I called the language Calc. You know, for calculator. Clever right? Right? Ok...uhm, moving on!

I also want a single, basic type. I decided to go with something equally as clever as the language’s name and called it an Integer. Brilliant, I know. Your admiration is duly noted. I decided to avoid dealing with floating point numbers for simplicity's sake as well as octal, binary, hexadecimal, and scientific notations. I leave it as an exercise for yourself to add these other notations in.

We’ll also need to specify the end of a file, a comment and mathematical operators.

It will also be essential to know operator precedence and encapsulation when we start to parse and evaluate our language. Call it a blessing or a curse but I decided at this time that I didn't want to mess with operator precedence. So, I've decided to use prefix notation and Lisp style expressions. This means that the operator will precede the arguments and be encapsulated in a left and right parenthesis. An example of adding two and three would look like:

Example 1 (simple expression)

(+ 2 3)
Result: 5

This has two advantages: 1) operator precedence becomes explicit; and, 2) we can perform an operation on an arbitrary number of values.

Example 2 (operator precedence):

(* (+ 3 1) (/ 4 2))
Result: 8


There’s no question here what operations are performed in which order. The addition and division operators are calculated independently, prior to the multiplication. No BEDMAS for you!

Example 3 (multiple values):

(* 1 2 3 4 5)
Result: 120


In this case, we process each element of the expression, moving from left to right. So we could rewrite this function in the equivalent following forms:

(*5(*4(*3(*2 1)))) == ((((1*2)*3)*4)*5) == 1*2*3*4*5

So, we need to define tokens for the left and right (open and closing) parentheses, and each mathematical operation we want to perform.

Last, we ought to throw in the ability to make comments. We’ll stick to single line comments but multi-line isn't that hard to add. Let’s use a single semi-colon to denote a comment.

The Tokens We Need

  • Left Parenthesis 
  • Right Parenthesis 
  • Addition Operator 
  • Subtraction Operator 
  • Multiplication Operator 
  • Division Operator (Quotient) 
  • Modulus Operator (Remainder) 
  • Integer 
  • Semi-Colon (Comment)
That’s everything we need to start lexing. The diligent will note that I have not defined tokens for any keywords, built-in functions or variables. That’s because our simple language doesn’t have any.

The point of this initial series isn’t to teach you to design a turing complete, fully functional programming language. The sole purpose is to give you the basic techniques to write an initial, working compiler.

I am currently writing a follow up series discussing how to add these features.

Whitespace


The next thing we have to decide on is how to handle whitespace. In languages like Lisp or Go, whitespace is mainly ignored. I say mainly because it’s not completely ignored. Strings and comments don’t ignore whitespace. It is also used to separate elements like numbers. However, it is ignored in the sense that is is not important to the syntax or semantics of the language.

Whitespace, incidentally, is typically a space, tab or newline. On Windows, you’ll want to toss carriage returns in with that lot, too. Basically, any non-visible characters you can type on a standard keyboard.

Now, your language could take whitespace into consideration. In Python, for example, it is very important, where indentation indicates statement blocks. Calc will not.

Syntax


We've outlined the elements our code will contain. Now we need to provide some meaning to it.

We are largely concerned with expressions in our language. We make no statements (terrible joke, I hope at least one person gets it). We want to express mathematical equations.

Each expression must be encapsulated in brackets, begin with an operator and must contain two or more operands. An example:

“(“ operator operand1 operand2 [… [operandN]] “)”

I've put the parentheses in quotations to signify they are a literal, required element which do not have extended grammar rules. The operator and first two operands are also required elements. The square brackets signify the additional operands are optional.

What can our operators be? They may only be one of the following symbols: + - * / %

How about our operands? Well, now things get a little trickier but I think we can handle it. Our operands can either be number or another expression. This allows us to nest expressions for making complex calculations.

I’d like to express our syntax in another form. It’s a grammar for defining something called a context free grammar. This particular one is a variant on Backus-Naur Form (BNF) with some regular expressions thrown in. Using this notation, we can express our language like this:

file ::= expr
expr ::= “(“ operator whitespace expr (whitespace expr)+ ”)” | integer
operator ::= [+-*/%]
whitespace ::= [ \t\n\r]+
integer ::= digit+
digit ::= [0-9]


I am no expert on context free grammars so hopefully I haven’t cocked it up. As you can see, it represents the requirements listed above quite succinctly and will make designing our parser much easier than without it.

Final Notes


Our language is quite simple but the grammar helps to make it clear what we expect and will make writing our scanner and parser much easier.

Looking at the integer definition, it’s quite clear what we need to scan for. We will need to find one or more digits between zero and nine. The same is true of white space where we need one or more spaces, tabs or newlines.

When parsing we know exactly what to expect. Our program consists of a single expression. Each expression is enclosed in brackets and contains exactly one operator which must be the second element of the expression. The operands may either be an expression or a number. We must have at least two operands.