In short, I didn't feel like I needed to know.
That is, until I started learning about compilers and assembly. Suddenly, how data was represented in memory mattered and I began having to ask a lot of questions.
Questions like: How does a struct or array look like in memory? How are their elements accessed? Is extra space is needed for the struct itself?
There are even more questions to ask about word boundaries, cache hits and paging but for the sake of brevity, we’ll keep it to just those posed.
Because Science!
OK, not really.
C is pretty close to the hardware and I probably understand the x86 assembly for it best so I chose to write an example in that language and peeked at the assembly output.
struct { int a, b; } s = {1, 2};
A simple sizeof call revealed that the above struct was 8 bytes. Each integer takes up 4 bytes of space so this is a reasonable result.
Next, I inspected the assembly generated by gcc. The values 1 and 2 were moved onto the stack, side by side. We would call this segment of memory contiguous.
Finally, I wrote a third test which passed the struct as an argument to a function. Again, only the two values, represented by ‘a’ and ‘b’, were copied onto the stack as arguments to the function.
Other than how they were arranged in memory, the fields of a struct are treated like any other variable.
Conclusions, Struct Test
Based on my tests, I was able to draw the following conclusions:
- The space needed for a C struct is the sum of its fields (see note below);
- The memory for a C struct is kept in a contiguous block of memory for easy access.
In the end, a struct is only meaningful to the compiler and programmer. Each field is merely a memory address and nothing else about the struct is known to the CPU.
What about methods?
Function Pointers
Actually, before we can discuss object methods, we need to talk functions. To talk about functions, we need to cover pointers.
A pointer is just like any other variable except that the value it holds is a memory address.
The size of a pointer is the size of a memory address in the target system. If the platform is 64 bits, a pointer is 8 bytes. If it’s 32 bits, the pointer is 4 bytes and so on.
void func(void) {};
void (*f)(void) = func;
The sizeof f on my system, which is 64bits, confirms that a pointer is 8 bytes.
Note: Taking the sizeof the function name, in this case func, gives the size 1 byte. To understand why, check out this response on Stack Overflow.
Methods
A method might seem a lot more complex than a function pointer but I’ll quickly demonstrate why it’s not.
In Go, you see something like this:
type myStruct struct {}
func (s myStruct) Method() {}
So why is it not so complex? Well, its because another way to express this same method call is:
func Method(s myStruct) {}
Once you recognise that a method call is merely syntactic sugar, dispatching the call becomes much simpler. In both cases, move the data in argument s to the stack or a register and call the function as normal using the first parameter as the struct data.
Arrays
Armed with knowledge we can deduce how an array works pretty easily. An array is merely a segment of contiguous memory.
int arr[5] = {0, 1, 2, 3, 4};
The assembly for this C array moves each value onto the stack one at a time and side by side. All the data for the stack is kept packed together but other than that there’s nothing special going on.
All the CPU needs to know is where the first element of the array can be found. We can access the rest by using a byte offset from the first element.
Note: If you ever wondered why an array in C is a pointer (like I did) then there’s your answer.
To access a specific index, x86 assembly uses SIB (scaled index byte) addressing: base + (scale * index).
Base is the address of the start of the array, in this case the pointer arr. The scale is the size of each element. If using a standard word size element on a 32 bit machine, it would be 4 bytes. From the address of arr, add 4 bytes multiplied by i to get the ith element.
Note: In C, arr[i] is equivalent to arr+i for this very reason. In both cases, you are accessing the ith memory address relative to arr’s address.
Note: A C array does not contain any information about the array itself, like its length. This is why C strings use a NULL character to terminate strings. It’s the only way that C functions can detect when it has reached the end of a character array. It would also another reason the C specification does not have bounds checking.
Implementing
I hope my explanations will help you implement arrays, structures (or classes), function pointers and methods in your own language. Much of these things seem very daunting but it’s not actually that hard.
The hard part is wrapping your head around it!
In the end, the generated code is pretty simple. The complexity is, generally speaking, going to be in your compiler.
If you’re interested in what some Go types look like, check out this blog post by Russ Cox here.
No comments:
Post a Comment