Essentially, there are - at a very high level - two major types of virtual machines; stack-based and register-based. A stack-based virtual machine requires we yank any data we want to manipulate or analyze onto a stack the virtual machine makes available before doing anything with it. A register-based virtual machine, on the other hand, will allow you to directly store the results of any calculations directly into a register (hence the name) without any stack usage. I personally find stack-based VMs a bit easier to understand, and they're arguably more widespread; both the .NET CLR and the JVM are implemented as stack-based VMs, though as a notable exception Lua's VM is a register-based implementation. In any case, Schego's VM is a stack-based one.
With some terminology out of the way, how do we go about implementing a VM? They're actually fairly straightforward things - we grab the next instruction to execute and simply run a different function depending on the one we grab. Slightly over-simplified, we create a giant switch-case structure and switch on the instruction. The beginning of the VMState.Step() function in Schego, which executes a single opcode, actually starts like this:
func (v *VMState) Step() { if v.CanStep() == false { // TODO: properly handle finished VM return } currentOpcode := v.NextOpcode() switch currentOpcode { // [opcode implementations here]
You could also handle this as a hashtable if you wanted, with the keys being the instructions and the values being function pointers to their respective implementations, but Lua uses a switch-case structure for this part of its VM, and for me, that's a sign that I probably don't stand to gain enough from switching.
What does an instruction look like, though? In Schego, as in most other VMs, an instruction (or opcode) is just a single byte that signifies what the following bytes in memory should be taken as. For an opcode to push an integer onto the stack, for instance, we'd expect the following bytes in memory to be the integer to push, like we have here:
case 0x03: // pushi // simply grab the next 8 bytes and push them intBytes := v.ReadBytes(8) v.Stack.PushInt(intBytes)
Not that hard to follow, right? Just grab the next 8 bytes in the program and treat them as a 64-bit integer to be pushed onto our stack for the rest of our program to manipulate. After that, we know the next byte should be treated as another opcode, and we can run our switch-case statement again. We can interact with the stack with a pop too, as in the case of comparing two integers:
case 0x40: // cmpi y := v.Stack.PopInt() x := v.Stack.PopInt() if x == y { v.Stack.PushByte(0) } else if x > y { v.Stack.PushByte(1) } else { v.Stack.PushByte(2) }
And here's an instruction we might use after comparing two integers - jne (jump if not equal):
func (v *VMState) jump() { addressBytes := v.ReadBytes(8) var address int64 binary.Read(bytes.NewBuffer(addressBytes), binary.LittleEndian, &address) v.opcodeBuffer.Seek(address, io.SeekCurrent) } //---------------------- case 0x2D: // jne cmpResult := v.Stack.PopByte() if cmpResult != 0 { v.jump() } else { // skip the jump address v.opcodeBuffer.Seek(8, io.SeekCurrent) }
See how we're just simply reimplementing the same logic our physical CPU might be performing for us? That's all a virtual machine is, no black magic or anything else crazy. At least for now; things will get slightly hairy later on down the line. But the basic switch-case logic detailed above is something all VMs share.
Now, we've seen how individual opcodes are implemented, what might a full-blown program in our little language look like? Here's the very first test I wrote for Schego's VM - a small program that just pushes an ASCII string of "Hello, World!" onto the stack before printing it and exiting with status code 0:
type DummyConsole struct { consoleOutput string } func (d *DummyConsole) Write(line string) { // trim null d.consoleOutput = strings.TrimRight(line, "\x00") } func TestHelloWorld(t *testing.T) { opcodes := []byte{ 0x05, // pushs 0x48, // H 0x65, // e 0x6C, // l 0x6C, // l 0x6F, // o 0x2C, // , 0x20, // space 0x57, // W 0x6F, // o 0x72, // r 0x6C, // l 0x64, // d 0x21, // ! 0x0A, // \n 0x00, // null 0x43, // syscall 0x05, // print string 0x03, // pushi 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // 0 0x43, // syscall 0x06, // exit } console := DummyConsole{} retcode := RunVM(opcodes, &console) if retcode != 0 { t.Error("Expected return code of 0, got:\n", retcode) } if console.consoleOutput != "Hello, World!\n" { t.Error("Incorrect output, got: ", console.consoleOutput) } }
The opcodes in the test are fairly decently annotated, so it should be pretty obvious what's going on. I've skipped some uninteresting stuff (like the specific implementation of Schego's stack) for the sake of keeping this article brief and interesting, and also some potentially interesting stuff (how do we push a string onto the stack?), since I intend on covering that in a later installment, especially considering how relatively technically dense the next article will be.
But you guys will have to wait just a little bit longer for that one. As always, though, the code is up on GitHub if you want to take a further look.
No comments:
Post a Comment