Thursday, 16 October 2014

Compiler 2 Part 4: Language Design - Variables

Table of Contents
  1. Introduction to Apprentice Compiler Design
  2. Language Specification
  3. Infrastructure
  4. Language Design Decisions
  5. Tokens
  6. Scanner
  7. AST
  8. Parser
  9. Assembly
  10. C Runtime
  11. Using the C Runtime (Assembly Part 2)
  12. Compiler, Code Generation
  13. Front End

Previously, I discussed the new language features and corresponding infrastructure that will be needed to implement Calc 2. Now I’d like to discuss language design a bit more, focusing on variable declarations.

In the first series of articles I discussed some of the thought that goes into language design. In this second series, we’re going to talk a bit more about that.

Variables


At first glance, variables are easy. A word or letter equals another value. No problem!

Not so fast!

First, you might decide what the declaration is even going to look like so we can fit it into our grammar rules.

Before anything can be finalized, we need to start asking ourselves a bunch of questions. Here’s a small sample:
  • How is assignment going to look? 
  • How will variable types be declared, if at all? 
  • What is a valid variable name? 
  • How will I determine the value of a variable? 
  • Should I have pointers? 
  • What happens when I take the address of a variable and assign it to itself, creating a circular reference? 
  • What happens when I try to access a variable that hasn’t been declared? 
  • How should I handle variable shadowing? 
  • Strong or weak typing? 
  • Dynamic or static typing?
That’s only a quick few off the top of my head. The questions feel endless sometimes but you need to keep asking them!

I realize you might not know what questions to ask. I also understand you might not know what type systems there are, what variable shadowing is and how addressing works. I can’t possibly teach you everything there is to know in a blog series but I’ll do my best to get you on the right track.

I leave it as an exercise for you to do a search for any of these terms or concepts you don’t understand. It’s a deep, deep rabbit hole so be prepared!

So lets address the majority of these questions as they pertain to Calc 2.

Variable Names


Most languages restrict what kind of characters can be used for a variable name. Usually, the first character must be a letter so that the scanner doesn’t have to make a decision on whether a sequence of characters is an identifier, number or something else. The rest of the characters should be letters or numbers and probably not any symbols outside of an underscore.

It begs the question, “Why?”

Consider a number versus an identifier. Is 0xFF a number or an identifier? We know, by looking at it, it’s most likely to be intended to be an hexadecimal representation of a number but how does our scanner figure that out? How does it know 0xFF is a number but OxFF isn't?

Since a number can contain letters and identifiers can contain numbers it stands to reason that we need SOME way to differential between the two. The typical answer is to specify that a sequence of numbers and letters that start with a letter is an identifier and a similar sequence starting with a number may be a number representation of some kind.

Why no symbols? Well, let’s take a look at another example: a+b. Are we adding two variables, “a” and “b”, or is it an identifier “a+b”? As people, we can usually use context to discover the meaning of the expression but it’s much harder for a compiler. Not only that, finding the correct context might be very time consuming, even for a human.

We might, for example, think that if neither a nor b were declared as variables then it must be an identifier. Can we safely assume that or is it reasonable to believe the programmer made a mistake and meant to declare each variable but forgot to?

Since including symbols in an identifier name is outweighed by the complexity of parsing the result, it’s simpler to not include them.

Strong, Static Typing


Before talking about a variable declaration we need to take a quick detour into type systems.

If you don’t know much, or anything, about them then I encourage you to do some research. It may not be perfect but Wikipedia is a good start. You can find place to start here.

In Calc's case, I chose strong, static typing. Once a variable is a certain type then it is that type forever. If “a” is a string then it can only ever be assigned strings. It can’t be redeclared either within the same scope.

Static typing means that every variable must have it’s type set when it is declared. This can be done either through an explicit type name or through inference.

A type is said to be inferred when the type is taken from an assigned value and not via an explicit type name.

Variable Declaration


Should the type name come before or after the variable name? In C, the type comes before. In Go, it comes after. Why does it matter? Well, lets consider parsing the two following sets of code:

(int a)
(a int)

When it comes to parsing, neither really has a distinct advantage though I’d give a slight edge to option one for clarify of intent.

Let look at another set of examples. this time with the keyword ‘var’:

(var int a)
(var a int)

In this form, I actually like option two better because it looks nicer. Yes, that matters.

Look at the first option. After the var keyword the int looks out of place. Is it a keyword or a variable? Parsing-wise, it doesn't matter but this time it matters to the people who have to read it: us. To me, at least, option two is clearer. Declare a variable ‘a’ and make it of type ‘int’.

Assignment


In C based languages assignment looks like this: identifier = value. This seems sensible enough but there are other options, too:

(= a 5)
(set a 5)

As far as parsing goes, both are identical. The equal sign has the advantage of adding another operator and not introducing any new keywords. Unfortunately, it also might create ambiguity with equality, a common issue in many languages. Set does make our intentions more clear but at the cost of another keyword to parse.

Quite frankly, the cost is minimal but it’s not unreasonable to want to keep either the symbols in your language or the keywords to an absolute minimum. Languages like Scheme and Go do a good job. On the other end of the spectrum is C++, which has gotten out of hand.

I found this thread amusing. 357 keywords? That’s a level of insanity I’d never care to know.

Declaration with Assignment


It just keeps getting worse, doesn't it? First, assume that the language uses the var keyword to declare a variable and the equals sign for assignment. Here are a few different forms it could take:

(= (var a int) 5)
(var a int 5)
(var a 5) - using type inference
(var a 5 int)
(var a int = 5)
(var (= a 5))

I'm sure you could come up with more choices. So, which is the best?

That’s a tough question to answer. In my case, I decided to go with the last option and that has to do with declaring multiple variables and doing multiple assignment. Keep reading for an explanation.

Multi-Assignment


Calc 2 isn't going to have multi-assignment but it’s something to think about and you might want to add it to your own language. How might that look in Calc?

(=  a b, 1 2)
(= (a b) (1 2))
(= ( a 1) (b 2))

Just by looking at this you can see the added complexity. All three options are quite hideous and certainly gives me pause about how to add it properly.

You need to think about how a feature like this is going to fit with the rest of the language. Go’s handling of multiple variables fits like a glove in most cases. Swapping two variables is as easy as:

a, b = b, a

Looks, and feels, good. Consequently, multi-assignment from a function is equally as elegant:

a, b, c = returnsThreeResults()

How good is that? Damn fine. You can also chose to disregard certain variables from function calls with a simple underscore.

It’s clear that the Go developers thought carefully about how handle multiple assignment properly and I think they did a darned good job.

Looking to Go, I decided to use something similar.

(= a b, 1 2); assign1 to a, 2 to b
(= a b, b a); swap the values of a and b

So, when declaring multiple variables, or declaring with assignment:

(var a b int)
(var (= a b, 1 2))
(var (= a b, (returnTwoValues)))

Pointers


Calc 2 isn't going to have pointers but there’s a good chance that future versions will. I think I would make an effort to hide much of the complexity of pointers away from the user but they have their uses.

Pointers and dynamic memory allocation go hand in hand even though they’re mutually exclusive. Pointers can add a lot of complexity and the garbage collector that future versions of Calc are going to use will need to deal with that complexity.

You need to keep this in mind.

Look no further than C to see how bad pointer abuse can become. In Go, the developers decided that pointers were useful enough to include in the language but took efforts to prevent abuse. Case in point, Go doesn't have pointer arithmetic nor does Go allow casting of pointers to incompatible pointer types.

Undeclared Variables


Scoping and the symbol table allow us to detect when a variable is undeclared. Undeclared variables will result in an error in Calc 2.

You’re free to handle this however you want too. Perhaps you would prefer to create a new variable each time an undeclared variable is discovered. Bare in mind, however, that this means that if a programmer makes a typo a new variable will be created instead of warning them through a helpful error that the variable doesn't exist.

Something to think about.

Resolution


I hope I’ve demonstrated how something seemingly simple quickly becomes complex. These choices have far reaching consequences. We have to introduce new operators and keywords. We have to introduce new complexity for both ourselves and our users.

This was a fairly long post and I only covered something as simple as variable declarations!

Later on we’ll have to think about some special cases. For example, if a declaration is a valid expression then should it be allowed in variable assignment? Can you do something like (= a (var b int))? If not, how do we stop that?

I want to impress upon you the weight you bare. Seemingly simple decisions, like using LISP-like expressions, have huge impact on later decisions. Can I, for example, relegate these expressions to mathematical expressions only and use a C-like syntax for the rest of the language? Do I have to conform to encapsulating everything in parentheses?

The questions, and doubts, never end! I’ll just let you mull that over for a bit.

* Update - Fixed some typos *