No Effing Clue: Compiler 2 Part 3: Language Infrastructure

Previously, I talked about the new language features to be implemented in Calc 2. Now, I’d like to talk about some of the thought process behind these features and the new infrastructure that will be needed to implement them.

Functions

Procedures introduce some fun new concepts.

Scope is a word we use to talk about visibility of symbols. A symbol may represent a function or variable. Scoping determines when a variable is unique to a function or whether it can be accessed lexically.

Lexical scoping is a term used to describe a scoping method where variables are accessible within child scopes. Most languages I've used have lexical scoping. It means that when a variable has been declared in a function, it is accessible anywhere within that function, including nested functions or if statements.

As the parser reads from top to bottom, it will place any function declarations within the top-most scope, usually called the global scope. When a function is called, the global scope can be checked to see if a corresponding symbol exists and then the function can be accessed.

Of course, then there will need to be an entry point. In most compiled languages, this is called the main function or object.

Variables and Assignment

Most everything needed for variable assignment will be handled by scoping, too.

A variable, in many languages, is merely a symbol which represents a value in memory. It has a type, a name and a value. When assigning a value to a variable we need to ensure their types match and that it is mutable.

Mutable, from the word mutate, is the ability to change. By contrast, an immutable object is one that cannot change.

To assign a value to a variable we search for it in the current scope. If there is a parent scope, we move up a level and search for it there, recursively repeating the process until there are either no more scopes to search or we find the correct symbol.

Looping

Recursion is fun. For most of us, our first experience with looping comes from the venerable ‘for’ and ‘while’ statements. Calc will eventually have such constructs but in Calc 2 all we have is recursion.

In the most simple terms, a recursive function is one which calls itself.

Branching

The if statement is your typical, basic branching mechanism. A branch is like a fork in the road. Depending on a specific criteria, you chose to go down either one path or the other and they are mutually exclusive.

A branch has it’s own scope. You may declare a variable inside an if statement and it will be unique to that branch and will not be accessible outside of the branch. It uses normal scoping rules.

Types

Even though Calc 2 will still consist of a single type, the infamous integer, I will be introducing static typing and type-checking.

Static typing is a type system where a type is either explicitly or implicitly set when a variable is declared at compile time. What does this mean?

It means that the type of a variable is determined before a program is ever run and can be checked for type-correctness (types match or are compatible) when the program is compiled.

By contrast, a dynamic type system checks the type of a variable at runtime (while the program is running).

As previously stated, Calc 2 still only has one type. The basic “int” type is 32 bits. It is also signed, meaning that one bit is reserved to determine whether the number is positive or negative. Be aware of these limitations if you try to calculate a number that is too large to be held in this data type.

Negative Numbers

From a human standpoint a negative number is pretty easy. It’s a dash before a number.

That’s the key. It’s dash before a number. The number itself is not negative. It’s a positive number with a dash prepended to it.

The scanner is only aware that a number is a series of uninterrupted digits; so, lexically, a negative number is a bit tougher. The dash is scanned first as a separate lexical element from the number. The parser receives both of them separately.

Enter what we call a unary expression. This type of expression has only a single operand. This is in contrast to a binary expression, which has two operands. During parsing we’ll need to remember that the dash may indicate subtraction or negation.

Multiple Source Files

Multiple sources bundled up into a single object brings some unique difficulties but is, overall, simpler than one might think.

The only real obstacle to overcome is resolving symbols which exist in other files. A function may be called that exists somewhere else or not at all. Naming collisions could also happen.

Next Up…

Language design decisions!

No Effing Clue

Saturday, 11 October 2014

Compiler 2 Part 3: Language Infrastructure