Writing a VM
I promised a while back to learn compilers in earnest – the programs that turn your high-level code into the low-level bits that your computer can execute.
I wrote up a few tree-walk interpreters – they would tokenize the programming language provided, parse it, and then execute the AST itself, executing the code.
Tree-walk interpreters work pretty well for the most part – they’re easier to write, and have great introspection (each node has all the info it needs). But it didn’t really scratch my itch of going all the way to the metal.
If I write an assembly file to return the number 10, I would write:
.globl mainmain:
mov $10, %rax
ret
I can compile that output:
$ gcc test.s -o a.out
And objdump -d
the generated file:
$ objdump -d a.out
0000000000401106 <main>:
401106: 48 c7 c0 0a 00 00 00 mov $0xa,%rax
40110d: c3 ret
So our instructions are encoded as
48 c7 c0 0a 00 00 00 c3
as bytes.
mov $10 %rax
corresponds to
48 c7 c0 0a 00 00 00
, and ret
turns into
c3
.
So, our “high-level” assembly that we wrote is turned into those bytes, and the computer can read those bytes and execute them.
A tree-walk interpreter stops before serializing and deserializing instructions – it runs the instructions from in-memory, so there’s no need to dump the state to disk.
To be fair, interpreters have a way to dump the tree that they’re executing to disk, like running this command:
$ clang -Xclang -ast-dump=json -fsyntax-only -Wno-visibility test.c
# emitted JSON here.
So we’d like to do that, by turning our instruction stream from those high level instructions into bytes, and then being able to read those bytes back and execute the program from binary.
The way I did was sketching out a VM with a set amount of instructions. Each instruction would be the first byte (so I had a cap of 256 instructions), and then the following bytes would be the arguments.
Next, I had to handle the arity of each instruction. If you have say
mov $10, %rax
and mov %rbx, %rax
, these have
to be encoded as two distinct instructions – one is an immediate to
register move, and the other is a register to register move, even though
they’re both called mov
in the written assembly.
So, I ended up with some 25 instructions, and each instruction would slurp up the bytes it wanted from a binary, or serialize to those bytes, and then voila, as long as you had a VM that could take those bytes and return them to the assembly instructions, you could write any program you wanted.
An example program, which would print 10 and then exit would look like this in assembly:
10 R0
putreg
printreg R0ret
I could then encode that file as binary:
$ cargo r -q -- -e print.asm > out.bin
And look at its contents:
$ xxd out.bin
00000000: 010a 0000 0900 00 .......
And run it from the binary format.
cargo r -q -- -d out.bin
10
So the program went all the way from the assembly itself, to bytes, and then was run properly by the VM.
If you’d like to take a look at the VM code on github I’ve linked it here.
Next step – writing a high level language that can emit the bytecode, so I can go all the way from a high-level language to the bare metal.