Monday, April 30, 2018

Schego Part 4, Or The Heap, Linked Lists, and Other Scary Subsystems

This is the second half (see the first half here) of the giant blog update on Schego I wrote. I concentrated so hard on actually coding, I forgot to update the blog as I went! Hopefully I'll be able to split my time more evenly between documentation and coding, but on with the show; we're going to talk about a couple different new additions to Schego that most interpreted programming languages will have to implement in one way or another.

The first up, and the subsystem that everything else in this blog post depends on in one way or another, is the new heap allocator based on the buddy memory allocation technique. It's fairly venerable (I believe the Linux kernel uses a modified version for kmalloc) and is also pretty easy to implement, with Schego's implementation spanning a hair under 200 lines as of this time of writing.

I'm gonna skimp on justifying the need for a heap allocator since I feel it's kinda obvious (how many non-embedded programs have you written that just used the stack?), but I do think it would be a good idea to examine why I'd use the buddy allocator in particular. It's certainly not the only allocator on the block, but it's definitely the most popular one; when looking around for preexisting allocators to study, the buddy allocator was by far and away the most covered. I think the advantage of having one or more reference implementations is difficult to oversell; when you're unsure if you're "doing it right," having something beyond a white paper is pretty handy. You could argue that if you're about to write a memory allocator you should be able to implement it based solely on a white paper, but the whole point of this series is to teach, right? So let's do some teaching, especially since prose explanations of any memory allocation algorithm are basically non-existent at this point.

The buddy system at its core is pretty intuitive: The basic idea involves simply splitting up the heap into these stretches of memory called blocks that we then can place our variables into. Since we want to handle differently-sized objects, say, a single 8-bit value versus a 32-bit array of 8-bit values, we can have differently-sized blocks. Also, because referring to blocks by how many bits or bytes they take up is kind of unwieldy, let's use the idea of an order system to specify how large our blocks are. So at first, we have some base size that we'll define as order 0 (an order 0 block in Schego is currently 32 bytes, for instance), and then each order beyond that is twice as big as the one preceding it, or in other words, order n has a length of (order 0 length) * 2 ^ n. In Schego, the order calculation looks something like this right now:

func (h *VMHeap) OrderFor(requestedBytes uint64) uint8 {
 if(requestedBytes <= blockSize) {
  return 0
 }
 // TODO: handle requests past maxOrder gracefully
 // this is ugly, but kinda fast, and less stupid than the old solution
 var order uint8
 order = uint8(math.Log2(float64(requestedBytes)) - math.Log2(float64(blockSize)))
 return order
}

By the way, the math in the OrderFor function is just a simplification of the order calculation I mentioned earlier so we can solve for n directly - don't worry that it looks a bit different.

Anyway, now we have a way to refer to chunks of memory, but how do we actually allocate objects? First, we figure out what the smallest order block would be capable of holding the requested number of bytes with OrderFor. If there's already a free block in the heap with the same size order, we're done - we simply return the address of that block. But what if only bigger blocks are available? This will have to happen at least once, too, as the heap starts its life as one giant block.

In that case, we have to subdivide one of the bigger blocks down the middle, and simply repeat the process on one of the halves until we have a block of the size they want. It requires a bit of bookkeeping in practice, but on paper it's fairly straightforward. As a nice bonus, we've created a bunch of other free blocks in the process that should fit the bill if the caller requests allocation of similarly-sized objects in the future, saving us a bit of work. Additionally, if only smaller blocks are available we've run out of memory and need to tell the caller that that's happened, so we only deal with if larger blocks are available. Schego implements the splitting process like so:

func (h *VMHeap) CreateBlock(order uint8) {
 // find smallest order that we can pull from
 freeOrder := order + 1
 for {
  if h.NoFreeBlocksFor(freeOrder) {
   freeOrder += 1
  } else {
   break
  }
 }
 // repeatedly split blocks until we get one (technically, two) of the order we originally wanted
 for freeOrder > order {
  blockAddress := h.GetFreeBlock(freeOrder)
  h.SplitBlock(blockAddress, freeOrder)
  freeOrder -= 1
 }
}

So that's allocation handled, but how do we deallocate objects? If you followed the block subdivision part of allocation, you might notice blocks of the same size get created in pairs. Sometimes one of those blocks then gets subdivided further, but this idea of pairs - or buddies - is where the buddy allocation system gets its name. After freeing up a single block (which is as easy as putting its address back on the list of free blocks), we check if its buddy exists and is free, too. If so, we merge the two blocks back together into a single larger block, and do the same buddy check/merge process until no buddy can be found. This keeps the heap from fragmenting and wasting space, and generally makes it such that there are only barely more blocks (each time we request a block, we're potentially creating 2 blocks with that size, remember?) than necessary to cover all objects in the heap. In Schego, the merge process is done like this:

func (h *VMHeap) MergeWithBuddy(address uint64, order uint8) {
 buddyAddress := h.GetBuddyAddress(address, order)
 // figure out which address is lower and delete the other block
 // take the lower address for the new merged block
 var newAddress uint64
 if buddyAddress < address {
  newAddress = buddyAddress
  delete(h.blockMap, address)
 } else {
  newAddress = address
  delete(h.blockMap, buddyAddress)
 }
 buddyIndex := h.GetUnusedBlockIndex(buddyAddress, order)
 h.RemoveBlockFromUnused(buddyIndex, order)
 blockIndex := h.GetUnusedBlockIndex(address, order)
 h.RemoveBlockFromUnused(blockIndex, order)
 h.blockMap[newAddress] = order + 1
 h.unusedBlocks[order+1] = append(h.unusedBlocks[order+1], newAddress)
 // recurse if we still have potential merging left undone
 if h.HasBuddy(newAddress, order+1) {
  h.MergeWithBuddy(newAddress, order+1)
 }
}

After all that, our top-level Allocate() and Free() functions are as simple as this:

func (h *VMHeap) Allocate(numBytes uint64) uint64 {
 order := h.OrderFor(numBytes)
 if h.NoFreeBlocksFor(order) {
  h.CreateBlock(order)
 }
 blockAddress := h.GetFreeBlock(order)
 // GetFreeBlock always returns the first/0th free block,
 // so remove that one
 h.RemoveBlockFromUnused(0, order)
 return blockAddress
}

func (h *VMHeap) Free(address uint64) {
 order := h.blockMap[address]
 // add the newly freed block back to the list of unused blocks
 // MergeWithBuddy will take care of removing it if need be due to merging
 h.unusedBlocks[order] = append(h.unusedBlocks[order], address)
 if h.HasBuddy(address, order) {
  h.MergeWithBuddy(address, order)
 }
}

The only real downside that I'm aware of with the buddy allocator is that there's a bit of fine-tuning involved with block sizes; if blocks are too big, there will be a lot of wasted space. Going back to Schego, every time an 8-byte integer is allocated, 24 bytes of heap memory are currently wasted. On the other hand, smaller block sizes mean more merging upon deallocation, which is a performance hit and time wasted. I haven't looked yet to see how heap allocators in other virtual machines handle this, but I plan on doing so shortly.

That was a lot to take in, I know, but once we have our heap built and tested, the most we'll have to interact with it is to use its Allocate() and Free() functions - the rest is all under the hood. So all that said, let's examine the next piece of the puzzle - strings.

Strings aren't a difficult concept. They're just a bunch of bytes in memory all lined up in a row that get treated as letters and stuff - most of you reading this now could've probably implemented basic string functionality in Schego given a heap allocator pretty easily. The R7RS Scheme spec throws a mild curveball at us, however, in that proper string support also involves supporting UTF-8 as well.

Pretty much all of you are familiar with the ASCII character encoding that older languages like C support out of the box - one byte per character, all that good stuff. Most of you have probably heard of UTF-8 (and definitely used it - you're reading UTF-8-encoded text right now), but maybe not had to ever think about writing explicit support for it in your applications - many programming languages today like Python 3 offer built-in support for UTF-8 in their strings.

Going over the UTF-8 specification, it's fairly simple, but for those of you coming from ASCII, you now have to deal with the fact that you're not guaranteed that every character is going to be some fixed number of bytes; for instance, all ASCII characters have the same exact length and value in UTF-8, but all non-Latin characters take up more than one byte. In technical terms, UTF-8 is what is known as a variable-width character encoding.

This requires only a little bit of planning out on our end, though, since UTF-8 makes it pretty easy to figure out how large each character is. All we have to do is check the first four bits of the first byte. If we get all zeros, we've got a 1-byte ASCII character or the null character on our hands. If we get a 0xC, that means our character will be 2 bytes long, and for 0xE, 3 bytes. Currently Schego assumes that for any other value the character will be 4 bytes long, which will always be the case in properly-formed UTF-8 text, but could pose a problem when ensuring no invalid characters slipped through the cracks. Anyway, here's how Schego currently reads in UTF-8 text off the stack:

                utfBytes := make([]byte, 0)
  for {
   firstByte := v.ReadBytes(1)[0]
   if firstByte&0x80 == 0 || firstByte == 0 {
    // ASCII or null byte
    utfBytes = append(utfBytes, firstByte)
    // null?
    if firstByte == 0 {
     break
    } else {
     continue
    }
   }
   // with UTF-8, the most significant bits tell us
   // how many bytes to read, so construct some simple bitmasks
   // to check
   // see: https://en.wikipedia.org/wiki/UTF-8#Description
   // essentially, check the upper four bits of the first byte,
   // with 1111 meaning read 4 bytes total, 1110 3 bytes, and
   // 1100, 2 bytes
   var codepointLength int
   upperFourBits := firstByte >> 4
   if upperFourBits == 0xC {
    codepointLength = 2
   } else if upperFourBits == 0xE {
    codepointLength = 3
   } else {
    // naively assume 4-byte codepoint length,
    // could be dangerous, should probably be replaced
    // at some point in the future
    codepointLength = 4
   }
   // we've already read one byte (firstByte)
   numBytes := codepointLength - 1
   extraBytes := v.ReadBytes(numBytes)
   utfRune := append([]byte{firstByte}, extraBytes...)
   // append the finished character
   utfBytes = append(utfBytes, utfRune...)
  }
  v.Stack.PushString(utfBytes)

That's actually the entire current implementation of the opcode to push a string onto the stack. Not that bad, huh? Putting strings on the heap isn't too much extra work. All we really have to do is ensure there's enough space to store the string in the heap, store the length along with it, and maybe reallocate space if what we currently allocated isn't enough:

         mnemonic := string(v.ReadBytes(2))
  address := v.mnemonicMap[mnemonic]
  strBytes := v.Stack.PopString()
  var strBuffer bytes.Buffer
  numBytes, _ := strBuffer.Write(strBytes)
  numBytes64 := uint64(numBytes)
  var allocatedBytes uint64
  binary.Read(v.Heap.Read(8, address), binary.LittleEndian, &allocatedBytes)
  if numBytes64 > allocatedBytes {
   v.Heap.Free(address)
   newAddress := v.Heap.Allocate(8 + numBytes64)
   v.mnemonicMap[mnemonic] = newAddress
   intBuffer := bytes.NewBuffer(make([]byte, 0))
   binary.Write(intBuffer, binary.LittleEndian, &numBytes64)
   v.Heap.Write(intBuffer, newAddress)
   // offset by 8 to avoid writing over the int we just wrote
   v.Heap.Write(&strBuffer, newAddress+8)
  } else {
   v.Heap.Write(&strBuffer, address+8)
  }

This is also a good time to introduce Schego's reference mnemonics into the picture, which are just 2 arbitrary bytes that can refer to some point in memory. You can think of them like variable names or pointers - the heap allocator decides what specific address in memory they point to, but you can always refer to that address with the name.

That about does it for strings. There's some test cases showing off the UTF-8 support, but you can go look at the code yourself if you're curious - they don't exactly tell you anything you don't already know at this point. So let's keep moving to linked lists.

Schego, like a good little Scheme implementation, provides bytecode-level support for linked lists. Under the hood, a single linked list node in Schego is pretty straightforward, being composed of just 3 different unsigned integers: The first stores the number of bytes allocated for the data held in the node, the next stores the address of the data itself, and the last integer stores the address of the next node in the list. This basically means we just pop 3 integers at a time under the hood inside the linked list-related opcodes, so as long as our underlying code is solid (implementation of one of the opcodes revealed a bug in the addi instruction that caused an extra byte to be prepended to every sum, ouch) and we always push/pop in the correct order, the resulting implementation is pretty boring. For instance, this is the implementation for what's known as car (Contents of Address Register - long story), or in other words, the function to retrieve the data stored at a linked list node and put it onto the stack:


                cell := v.Stack.PopCell()
  numBytes := cell[0]
  dataAddress := cell[1]
  cellData := v.Heap.Read(numBytes, dataAddress).Bytes()
  for i := uint64(0); i < numBytes; i++ {
   v.Stack.PushByte(cellData[numBytes-i-1])
  }
  // hack to get VMStack.lenLastPushed to play nicely with hcar
  // should probably be replaced with a dedicated method on
  // VMStack in the future (PushCellData?)
  v.Stack.lenLastPushed = numBytes

Not that involved, right? Let's close out the article with the test case that shows how all the list opcodes are meant to be used together:

func TestListSum(t *testing.T) {
 // create a list containing the values 1, 2, and 3,
 // and sum them up
 // mnemonics:
 // 0xBEEF - 1st list cell/list head
 // 0xDEAD - 2nd list cell
 // 0xFACE - 3rd list cell
 // 0xACED - sum counter
 opcodes := []byte{
  0x25, // hnewl
  0xBE,
  0xEF, // reference mnemonic - 0xBEEF
  0x06, // cons
  0x03, // pushi
  0x01,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00, // 1
  0x4B, // hscar
  0x0D, // hstorel
  0xBE,
  0xEF, // reference mnemonic - 0xBEEF
  0x25, // hnewl
  0xDE,
  0xAD, // reference mnemonic - 0xDEAD
  0x06, // cons
  0x03, // pushi
  0x02,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00, // 2
  0x4B, // hscar
  0x0D, // hstorel
  0xDE,
  0xAD, // reference mnemonic - 0xDEAD
  0x19, // hloadl
  0xBE,
  0xEF, // reference mnemonic - 0xBEEF
  0x4D, // hscdr
  0xDE,
  0xAD, // reference mnemonic - 0xDEAD
  0x0D, // hstorel
  0xBE,
  0xEF, // reference mnemonic - 0xBEEF
  0x25, //  hnewl
  0xFA,
  0xCE, // reference mnemonic - 0xFACE
  0x06, // cons
  0x03, // pushi
  0x03,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00, // 3
  0x4B, // hscar
  0x0D, // hstorel
  0xFA,
  0xCE, // reference mnemonic - 0xFACE
  0x19, // hloadl
  0xDE,
  0xAD, // reference mnemonic - 0xDEAD
  0x4D, // hscdr
  0xFA,
  0xCE, // reference mnemonic - 0xFACE
  0x0D, // hstorel
  0xDE,
  0xAD, // reference mnemonic - 0xDEAD
  0x22, // hnewi
  0xAC,
  0xED, // reference mnemonic - 0xACED
  // zero out our allocated memory, as we are not
  // currently guaranteed any new memory will be zeroed
  0x03, // pushi
  0x00,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00, // 0
  0x0A, // hstorei
  0xAC,
  0xED, // reference mnemonic - 0xACED
  0x19, // hloadl
  0xBE,
  0xEF, // reference mnemonic - 0xBEEF
  // use dup/hcdr to leave the list cells on the stack
  // in reverse sequential order (3-2-1 from top to bottom, and not 1-2-3)
  0x07, // dup
  0x49, // hcdr
  0x07, // dup
  0x49, // hcdr
  0x47, // hcar
  0x16, // hloadi
  0xAC,
  0xED, // reference mnemonic - 0xACED
  0x36, // addi
  0x0A, // hstorei
  0xAC,
  0xED, // reference mnemonic - 0xACED
  0x47, // hcar
  0x16, // hloadi
  0xAC,
  0xED, // reference mnemonic - 0xACED
  0x36, // addi
  0x0A, // hstorei
  0xAC,
  0xED, // reference mnemonic - 0xACED
  0x47, // hcar
  0x16, // hloadi
  0xAC,
  0xED, // reference mnemonic - 0xACED
  0x36, // addi
  0x43, // syscall
  0x03, // print integer
  0x03, // pushi
  0x00,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00, // 0
  0x43, // syscall
  0x06, // exit
 }
 console := DummyConsole{}
 RunVM(opcodes, &console)
 if console.consoleOutput != "6" {
  t.Error("Incorrect output, got: ", console.consoleOutput)
 }
}

Pretty lengthy, I know, but only because of the tedium involved at writing out a linked list at the bytecode level. If you're already familiar with Scheme/LISP, or just functional programming in general, the general flow of the program should make a fair amount of sense. But that about does it for this installment. As usual, the code is on GitHub if you want to take a deeper look.

No comments:

Post a Comment