Kermit Alexander II's blog

Wednesday, March 6, 2019

Our Augustine - a serial

For about a year or so, I've been intending on publishing a serial I've been writing, which I've dubbed "Our Augustine." I think the prologue - which is what I'm releasing in conjunction with this blog post - should mostly explain itself, but to make a (mostly-unpublished!) not-so-long story short, I set out to answer this question: How would a society that truly does not believe in the concept of free will structure itself? How would it come to make decisions it considered ethical? I've tried to wrap up what I believe to be the answer in a reasonably-entertaining sci-fi war story, but you can be the judge of that yourself by reading the prologue here.

I'm licensing the text of the serial under the CC BY-SA-4.0 license; in the future, I'll try to embed the text of successive releases in the blog itself.

Saturday, November 17, 2018

Homogenization of Contexts, or The Right to a Hot Take

A PDF copy of this essay can be found here.

This is something that's a little bit out of the ordinary for my blog - something more philosophical than technical - but it's something that's been weighing on my mind for some time now. I've discussed the ideas I present here with many of my friends, and they understood (though not all necessarily agreed!), so maybe the internet at large would be interested as well. I think that the growing homogenization of the internet at large through sites like Facebook and YouTube, as well as the growing use of SSO (Single Sign-On) solutions to use a single account to login to a variety of smaller sites, has led to a chilling effect on the honesty of internet-based discourse; but more importantly, threatens to prevent our complete selves from being expressed to the world, which could prove extraordinarily harmful in the long run.

Schego Part 4, Or The Heap, Linked Lists, and Other Scary Subsystems

This is the second half (see the first half here) of the giant blog update on Schego I wrote. I concentrated so hard on actually coding, I forgot to update the blog as I went! Hopefully I'll be able to split my time more evenly between documentation and coding, but on with the show; we're going to talk about a couple different new additions to Schego that most interpreted programming languages will have to implement in one way or another.

The first up, and the subsystem that everything else in this blog post depends on in one way or another, is the new heap allocator based on the buddy memory allocation technique. It's fairly venerable (I believe the Linux kernel uses a modified version for kmalloc) and is also pretty easy to implement, with Schego's implementation spanning a hair under 200 lines as of this time of writing.

I'm gonna skimp on justifying the need for a heap allocator since I feel it's kinda obvious (how many non-embedded programs have you written that just used the stack?), but I do think it would be a good idea to examine why I'd use the buddy allocator in particular. It's certainly not the only allocator on the block, but it's definitely the most popular one; when looking around for preexisting allocators to study, the buddy allocator was by far and away the most covered. I think the advantage of having one or more reference implementations is difficult to oversell; when you're unsure if you're "doing it right," having something beyond a white paper is pretty handy. You could argue that if you're about to write a memory allocator you should be able to implement it based solely on a white paper, but the whole point of this series is to teach, right? So let's do some teaching, especially since prose explanations of any memory allocation algorithm are basically non-existent at this point.

The buddy system at its core is pretty intuitive: The basic idea involves simply splitting up the heap into these stretches of memory called blocks that we then can place our variables into. Since we want to handle differently-sized objects, say, a single 8-bit value versus a 32-bit array of 8-bit values, we can have differently-sized blocks. Also, because referring to blocks by how many bits or bytes they take up is kind of unwieldy, let's use the idea of an order system to specify how large our blocks are. So at first, we have some base size that we'll define as order 0 (an order 0 block in Schego is currently 32 bytes, for instance), and then each order beyond that is twice as big as the one preceding it, or in other words, order n has a length of (order 0 length) * 2 ^ n. In Schego, the order calculation looks something like this right now:

func (h *VMHeap) OrderFor(requestedBytes uint64) uint8 {
 if(requestedBytes <= blockSize) {
  return 0
 }
 // TODO: handle requests past maxOrder gracefully
 // this is ugly, but kinda fast, and less stupid than the old solution
 var order uint8
 order = uint8(math.Log2(float64(requestedBytes)) - math.Log2(float64(blockSize)))
 return order
}

By the way, the math in the OrderFor function is just a simplification of the order calculation I mentioned earlier so we can solve for n directly - don't worry that it looks a bit different.

Anyway, now we have a way to refer to chunks of memory, but how do we actually allocate objects? First, we figure out what the smallest order block would be capable of holding the requested number of bytes with OrderFor. If there's already a free block in the heap with the same size order, we're done - we simply return the address of that block. But what if only bigger blocks are available? This will have to happen at least once, too, as the heap starts its life as one giant block.

In that case, we have to subdivide one of the bigger blocks down the middle, and simply repeat the process on one of the halves until we have a block of the size they want. It requires a bit of bookkeeping in practice, but on paper it's fairly straightforward. As a nice bonus, we've created a bunch of other free blocks in the process that should fit the bill if the caller requests allocation of similarly-sized objects in the future, saving us a bit of work. Additionally, if only smaller blocks are available we've run out of memory and need to tell the caller that that's happened, so we only deal with if larger blocks are available. Schego implements the splitting process like so:

func (h *VMHeap) CreateBlock(order uint8) {
 // find smallest order that we can pull from
 freeOrder := order + 1
 for {
  if h.NoFreeBlocksFor(freeOrder) {
   freeOrder += 1
  } else {
   break
  }
 }
 // repeatedly split blocks until we get one (technically, two) of the order we originally wanted
 for freeOrder > order {
  blockAddress := h.GetFreeBlock(freeOrder)
  h.SplitBlock(blockAddress, freeOrder)
  freeOrder -= 1
 }
}

So that's allocation handled, but how do we deallocate objects? If you followed the block subdivision part of allocation, you might notice blocks of the same size get created in pairs. Sometimes one of those blocks then gets subdivided further, but this idea of pairs - or buddies - is where the buddy allocation system gets its name. After freeing up a single block (which is as easy as putting its address back on the list of free blocks), we check if its buddy exists and is free, too. If so, we merge the two blocks back together into a single larger block, and do the same buddy check/merge process until no buddy can be found. This keeps the heap from fragmenting and wasting space, and generally makes it such that there are only barely more blocks (each time we request a block, we're potentially creating 2 blocks with that size, remember?) than necessary to cover all objects in the heap. In Schego, the merge process is done like this:

func (h *VMHeap) MergeWithBuddy(address uint64, order uint8) {
 buddyAddress := h.GetBuddyAddress(address, order)
 // figure out which address is lower and delete the other block
 // take the lower address for the new merged block
 var newAddress uint64
 if buddyAddress < address {
  newAddress = buddyAddress
  delete(h.blockMap, address)
 } else {
  newAddress = address
  delete(h.blockMap, buddyAddress)
 }
 buddyIndex := h.GetUnusedBlockIndex(buddyAddress, order)
 h.RemoveBlockFromUnused(buddyIndex, order)
 blockIndex := h.GetUnusedBlockIndex(address, order)
 h.RemoveBlockFromUnused(blockIndex, order)
 h.blockMap[newAddress] = order + 1
 h.unusedBlocks[order+1] = append(h.unusedBlocks[order+1], newAddress)
 // recurse if we still have potential merging left undone
 if h.HasBuddy(newAddress, order+1) {
  h.MergeWithBuddy(newAddress, order+1)
 }
}

After all that, our top-level Allocate() and Free() functions are as simple as this:

func (h *VMHeap) Allocate(numBytes uint64) uint64 {
 order := h.OrderFor(numBytes)
 if h.NoFreeBlocksFor(order) {
  h.CreateBlock(order)
 }
 blockAddress := h.GetFreeBlock(order)
 // GetFreeBlock always returns the first/0th free block,
 // so remove that one
 h.RemoveBlockFromUnused(0, order)
 return blockAddress
}

func (h *VMHeap) Free(address uint64) {
 order := h.blockMap[address]
 // add the newly freed block back to the list of unused blocks
 // MergeWithBuddy will take care of removing it if need be due to merging
 h.unusedBlocks[order] = append(h.unusedBlocks[order], address)
 if h.HasBuddy(address, order) {
  h.MergeWithBuddy(address, order)
 }
}

The only real downside that I'm aware of with the buddy allocator is that there's a bit of fine-tuning involved with block sizes; if blocks are too big, there will be a lot of wasted space. Going back to Schego, every time an 8-byte integer is allocated, 24 bytes of heap memory are currently wasted. On the other hand, smaller block sizes mean more merging upon deallocation, which is a performance hit and time wasted. I haven't looked yet to see how heap allocators in other virtual machines handle this, but I plan on doing so shortly.

That was a lot to take in, I know, but once we have our heap built and tested, the most we'll have to interact with it is to use its Allocate() and Free() functions - the rest is all under the hood. So all that said, let's examine the next piece of the puzzle - strings.

Strings aren't a difficult concept. They're just a bunch of bytes in memory all lined up in a row that get treated as letters and stuff - most of you reading this now could've probably implemented basic string functionality in Schego given a heap allocator pretty easily. The R7RS Scheme spec throws a mild curveball at us, however, in that proper string support also involves supporting UTF-8 as well.

Pretty much all of you are familiar with the ASCII character encoding that older languages like C support out of the box - one byte per character, all that good stuff. Most of you have probably heard of UTF-8 (and definitely used it - you're reading UTF-8-encoded text right now), but maybe not had to ever think about writing explicit support for it in your applications - many programming languages today like Python 3 offer built-in support for UTF-8 in their strings.

Going over the UTF-8 specification, it's fairly simple, but for those of you coming from ASCII, you now have to deal with the fact that you're not guaranteed that every character is going to be some fixed number of bytes; for instance, all ASCII characters have the same exact length and value in UTF-8, but all non-Latin characters take up more than one byte. In technical terms, UTF-8 is what is known as a variable-width character encoding.

This requires only a little bit of planning out on our end, though, since UTF-8 makes it pretty easy to figure out how large each character is. All we have to do is check the first four bits of the first byte. If we get all zeros, we've got a 1-byte ASCII character or the null character on our hands. If we get a 0xC, that means our character will be 2 bytes long, and for 0xE, 3 bytes. Currently Schego assumes that for any other value the character will be 4 bytes long, which will always be the case in properly-formed UTF-8 text, but could pose a problem when ensuring no invalid characters slipped through the cracks. Anyway, here's how Schego currently reads in UTF-8 text off the stack:

                utfBytes := make([]byte, 0)
  for {
   firstByte := v.ReadBytes(1)[0]
   if firstByte&0x80 == 0 || firstByte == 0 {
    // ASCII or null byte
    utfBytes = append(utfBytes, firstByte)
    // null?
    if firstByte == 0 {
     break
    } else {
     continue
    }
   }
   // with UTF-8, the most significant bits tell us
   // how many bytes to read, so construct some simple bitmasks
   // to check
   // see: https://en.wikipedia.org/wiki/UTF-8#Description
   // essentially, check the upper four bits of the first byte,
   // with 1111 meaning read 4 bytes total, 1110 3 bytes, and
   // 1100, 2 bytes
   var codepointLength int
   upperFourBits := firstByte >> 4
   if upperFourBits == 0xC {
    codepointLength = 2
   } else if upperFourBits == 0xE {
    codepointLength = 3
   } else {
    // naively assume 4-byte codepoint length,
    // could be dangerous, should probably be replaced
    // at some point in the future
    codepointLength = 4
   }
   // we've already read one byte (firstByte)
   numBytes := codepointLength - 1
   extraBytes := v.ReadBytes(numBytes)
   utfRune := append([]byte{firstByte}, extraBytes...)
   // append the finished character
   utfBytes = append(utfBytes, utfRune...)
  }
  v.Stack.PushString(utfBytes)

That's actually the entire current implementation of the opcode to push a string onto the stack. Not that bad, huh? Putting strings on the heap isn't too much extra work. All we really have to do is ensure there's enough space to store the string in the heap, store the length along with it, and maybe reallocate space if what we currently allocated isn't enough:

         mnemonic := string(v.ReadBytes(2))
  address := v.mnemonicMap[mnemonic]
  strBytes := v.Stack.PopString()
  var strBuffer bytes.Buffer
  numBytes, _ := strBuffer.Write(strBytes)
  numBytes64 := uint64(numBytes)
  var allocatedBytes uint64
  binary.Read(v.Heap.Read(8, address), binary.LittleEndian, &allocatedBytes)
  if numBytes64 > allocatedBytes {
   v.Heap.Free(address)
   newAddress := v.Heap.Allocate(8 + numBytes64)
   v.mnemonicMap[mnemonic] = newAddress
   intBuffer := bytes.NewBuffer(make([]byte, 0))
   binary.Write(intBuffer, binary.LittleEndian, &numBytes64)
   v.Heap.Write(intBuffer, newAddress)
   // offset by 8 to avoid writing over the int we just wrote
   v.Heap.Write(&strBuffer, newAddress+8)
  } else {
   v.Heap.Write(&strBuffer, address+8)
  }

This is also a good time to introduce Schego's reference mnemonics into the picture, which are just 2 arbitrary bytes that can refer to some point in memory. You can think of them like variable names or pointers - the heap allocator decides what specific address in memory they point to, but you can always refer to that address with the name.

That about does it for strings. There's some test cases showing off the UTF-8 support, but you can go look at the code yourself if you're curious - they don't exactly tell you anything you don't already know at this point. So let's keep moving to linked lists.

Schego, like a good little Scheme implementation, provides bytecode-level support for linked lists. Under the hood, a single linked list node in Schego is pretty straightforward, being composed of just 3 different unsigned integers: The first stores the number of bytes allocated for the data held in the node, the next stores the address of the data itself, and the last integer stores the address of the next node in the list. This basically means we just pop 3 integers at a time under the hood inside the linked list-related opcodes, so as long as our underlying code is solid (implementation of one of the opcodes revealed a bug in the addi instruction that caused an extra byte to be prepended to every sum, ouch) and we always push/pop in the correct order, the resulting implementation is pretty boring. For instance, this is the implementation for what's known as car (Contents of Address Register - long story), or in other words, the function to retrieve the data stored at a linked list node and put it onto the stack:

                cell := v.Stack.PopCell()
  numBytes := cell[0]
  dataAddress := cell[1]
  cellData := v.Heap.Read(numBytes, dataAddress).Bytes()
  for i := uint64(0); i < numBytes; i++ {
   v.Stack.PushByte(cellData[numBytes-i-1])
  }
  // hack to get VMStack.lenLastPushed to play nicely with hcar
  // should probably be replaced with a dedicated method on
  // VMStack in the future (PushCellData?)
  v.Stack.lenLastPushed = numBytes

Not that involved, right? Let's close out the article with the test case that shows how all the list opcodes are meant to be used together:

func TestListSum(t *testing.T) {
 // create a list containing the values 1, 2, and 3,
 // and sum them up
 // mnemonics:
 // 0xBEEF - 1st list cell/list head
 // 0xDEAD - 2nd list cell
 // 0xFACE - 3rd list cell
 // 0xACED - sum counter
 opcodes := []byte{
  0x25, // hnewl
  0xBE,
  0xEF, // reference mnemonic - 0xBEEF
  0x06, // cons
  0x03, // pushi
  0x01,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00, // 1
  0x4B, // hscar
  0x0D, // hstorel
  0xBE,
  0xEF, // reference mnemonic - 0xBEEF
  0x25, // hnewl
  0xDE,
  0xAD, // reference mnemonic - 0xDEAD
  0x06, // cons
  0x03, // pushi
  0x02,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00, // 2
  0x4B, // hscar
  0x0D, // hstorel
  0xDE,
  0xAD, // reference mnemonic - 0xDEAD
  0x19, // hloadl
  0xBE,
  0xEF, // reference mnemonic - 0xBEEF
  0x4D, // hscdr
  0xDE,
  0xAD, // reference mnemonic - 0xDEAD
  0x0D, // hstorel
  0xBE,
  0xEF, // reference mnemonic - 0xBEEF
  0x25, //  hnewl
  0xFA,
  0xCE, // reference mnemonic - 0xFACE
  0x06, // cons
  0x03, // pushi
  0x03,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00, // 3
  0x4B, // hscar
  0x0D, // hstorel
  0xFA,
  0xCE, // reference mnemonic - 0xFACE
  0x19, // hloadl
  0xDE,
  0xAD, // reference mnemonic - 0xDEAD
  0x4D, // hscdr
  0xFA,
  0xCE, // reference mnemonic - 0xFACE
  0x0D, // hstorel
  0xDE,
  0xAD, // reference mnemonic - 0xDEAD
  0x22, // hnewi
  0xAC,
  0xED, // reference mnemonic - 0xACED
  // zero out our allocated memory, as we are not
  // currently guaranteed any new memory will be zeroed
  0x03, // pushi
  0x00,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00, // 0
  0x0A, // hstorei
  0xAC,
  0xED, // reference mnemonic - 0xACED
  0x19, // hloadl
  0xBE,
  0xEF, // reference mnemonic - 0xBEEF
  // use dup/hcdr to leave the list cells on the stack
  // in reverse sequential order (3-2-1 from top to bottom, and not 1-2-3)
  0x07, // dup
  0x49, // hcdr
  0x07, // dup
  0x49, // hcdr
  0x47, // hcar
  0x16, // hloadi
  0xAC,
  0xED, // reference mnemonic - 0xACED
  0x36, // addi
  0x0A, // hstorei
  0xAC,
  0xED, // reference mnemonic - 0xACED
  0x47, // hcar
  0x16, // hloadi
  0xAC,
  0xED, // reference mnemonic - 0xACED
  0x36, // addi
  0x0A, // hstorei
  0xAC,
  0xED, // reference mnemonic - 0xACED
  0x47, // hcar
  0x16, // hloadi
  0xAC,
  0xED, // reference mnemonic - 0xACED
  0x36, // addi
  0x43, // syscall
  0x03, // print integer
  0x03, // pushi
  0x00,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00, // 0
  0x43, // syscall
  0x06, // exit
 }
 console := DummyConsole{}
 RunVM(opcodes, &console)
 if console.consoleOutput != "6" {
  t.Error("Incorrect output, got: ", console.consoleOutput)
 }
}

Pretty lengthy, I know, but only because of the tedium involved at writing out a linked list at the bytecode level. If you're already familiar with Scheme/LISP, or just functional programming in general, the general flow of the program should make a fair amount of sense. But that about does it for this installment. As usual, the code is on GitHub if you want to take a deeper look.

Wednesday, April 25, 2018

Schego Part 3, Or Virtual Machines Aren't Really That Hard

Since the previous installment in the Schego saga, a lot's happened. So much, in fact, that I've split up the next update into 2 separate parts - the one you're reading now, and one that will follow very shortly. Life has a way of getting in the way; you get so absorbed with "important" stuff you forget the blog series you promised to update on a reasonably-sane basis. On the upside, those of you paying attention will notice that I've still been updating the actual codebase itself with some pretty exciting features, and I'll do my best to go over some of them with this entry, in which we'll be looking at stack-based virtual machines. I'm thinking most of you will be familiar with the concept of a virtual machine, but perhaps haven't coded one before, so I'm not going to delve into what a virtual machine is. You might not have heard the term "stack-based virtual machine," so I'll take a slight detour to cover that first.

Essentially, there are - at a very high level - two major types of virtual machines; stack-based and register-based. A stack-based virtual machine requires we yank any data we want to manipulate or analyze onto a stack the virtual machine makes available before doing anything with it. A register-based virtual machine, on the other hand, will allow you to directly store the results of any calculations directly into a register (hence the name) without any stack usage. I personally find stack-based VMs a bit easier to understand, and they're arguably more widespread; both the .NET CLR and the JVM are implemented as stack-based VMs, though as a notable exception Lua's VM is a register-based implementation. In any case, Schego's VM is a stack-based one.

With some terminology out of the way, how do we go about implementing a VM? They're actually fairly straightforward things - we grab the next instruction to execute and simply run a different function depending on the one we grab. Slightly over-simplified, we create a giant switch-case structure and switch on the instruction. The beginning of the VMState.Step() function in Schego, which executes a single opcode, actually starts like this:

func (v *VMState) Step() {
 if v.CanStep() == false {
  // TODO: properly handle finished VM
  return
 }
 currentOpcode := v.NextOpcode()
 switch currentOpcode {
                // [opcode implementations here]

You could also handle this as a hashtable if you wanted, with the keys being the instructions and the values being function pointers to their respective implementations, but Lua uses a switch-case structure for this part of its VM, and for me, that's a sign that I probably don't stand to gain enough from switching.

What does an instruction look like, though? In Schego, as in most other VMs, an instruction (or opcode) is just a single byte that signifies what the following bytes in memory should be taken as. For an opcode to push an integer onto the stack, for instance, we'd expect the following bytes in memory to be the integer to push, like we have here:

case 0x03:
  // pushi
  // simply grab the next 8 bytes and push them
  intBytes := v.ReadBytes(8)
  v.Stack.PushInt(intBytes)

Not that hard to follow, right? Just grab the next 8 bytes in the program and treat them as a 64-bit integer to be pushed onto our stack for the rest of our program to manipulate. After that, we know the next byte should be treated as another opcode, and we can run our switch-case statement again. We can interact with the stack with a pop too, as in the case of comparing two integers:

case 0x40:
  // cmpi
  y := v.Stack.PopInt()
  x := v.Stack.PopInt()
  if x == y {
   v.Stack.PushByte(0)
  } else if x > y {
   v.Stack.PushByte(1)
  } else {
   v.Stack.PushByte(2)
  }

And here's an instruction we might use after comparing two integers - jne (jump if not equal):

func (v *VMState) jump() {
 addressBytes := v.ReadBytes(8)
 var address int64
 binary.Read(bytes.NewBuffer(addressBytes), binary.LittleEndian, &address)
 v.opcodeBuffer.Seek(address, io.SeekCurrent)
}

//----------------------

case 0x2D:
  // jne
  cmpResult := v.Stack.PopByte()
  if cmpResult != 0 {
   v.jump()
  } else {
   // skip the jump address
   v.opcodeBuffer.Seek(8, io.SeekCurrent)
  }

See how we're just simply reimplementing the same logic our physical CPU might be performing for us? That's all a virtual machine is, no black magic or anything else crazy. At least for now; things will get slightly hairy later on down the line. But the basic switch-case logic detailed above is something all VMs share.

Now, we've seen how individual opcodes are implemented, what might a full-blown program in our little language look like? Here's the very first test I wrote for Schego's VM - a small program that just pushes an ASCII string of "Hello, World!" onto the stack before printing it and exiting with status code 0:

type DummyConsole struct {
 consoleOutput string
}

func (d *DummyConsole) Write(line string) {
 // trim null
 d.consoleOutput = strings.TrimRight(line, "\x00")
}

func TestHelloWorld(t *testing.T) {
 opcodes := []byte{
  0x05, // pushs
  0x48, // H
  0x65, // e
  0x6C, // l
  0x6C, // l
  0x6F, // o
  0x2C, // ,
  0x20, // space
  0x57, // W
  0x6F, // o
  0x72, // r
  0x6C, // l
  0x64, // d
  0x21, // !
  0x0A, // \n
  0x00, // null
  0x43, // syscall
  0x05, // print string
  0x03, // pushi
  0x00,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00,
  0x00, // 0
  0x43, // syscall
  0x06, // exit
 }
 console := DummyConsole{}
 retcode := RunVM(opcodes, &console)
 if retcode != 0 {
  t.Error("Expected return code of 0, got:\n", retcode)
 }
 if console.consoleOutput != "Hello, World!\n" {
  t.Error("Incorrect output, got: ", console.consoleOutput)
 }
}

The opcodes in the test are fairly decently annotated, so it should be pretty obvious what's going on. I've skipped some uninteresting stuff (like the specific implementation of Schego's stack) for the sake of keeping this article brief and interesting, and also some potentially interesting stuff (how do we push a string onto the stack?), since I intend on covering that in a later installment, especially considering how relatively technically dense the next article will be.

But you guys will have to wait just a little bit longer for that one. As always, though, the code is up on GitHub if you want to take a further look.

Monday, April 23, 2018

Sophomore year: a (rambling) retrospective

I wish there was a way to know you're in the good old days before you've actually left them.

"Gone in a flash" would be the phrase that first came to mind if you asked me to describe how my sophomore year of college went. Honestly, college in general has seemingly flown by; it doesn't seem like all that long ago I was a high school graduate, thinking of college as the "next big phase" of his life. And yet somehow, I'm on the tail end of that journey already. You've probably heard the old bit about how life seems to move faster as you get older - it certainly seems that way to me, at least.

The scenery is different compared to freshman year, if nothing else. I've traded the empty expanse of the dorm lobby for the living room of my apartment. I've got working internet without having to leave my own abode - though it was only a few hours prior to typing this that I got the wireless drivers working on the Debian installation on my Thinkpad. Unlike last time, I'm not sick, and I'm pretty grateful for that, considering finals week is looming. It's a lot quieter here in the living room than it was in the lobby, except for the unerring tick-tock of the wall clock mounted above the utility room door. The home theater setup me and my roommates collectively built sits against the wall opposite me - a gorgeous 4K TV and subwoofer in addition to a Blu-ray player and my trusty Playstation 2 and Playstation 4. I wish someone would've told me how little time for gaming I would have here, though.

The outside world isn't all that's changed. Skipping class has gone from a moral affront of the worst degree to a last resort if I can't keep myself awake during prior lectures. I've noticed myself becoming much more proactive and confident when it comes to getting involved in campus activities. I'm an undergraduate researcher now - not something I ever saw myself becoming during high school or even freshman year. I got a job offer from Cisco to come work with them in Silicon Valley over the summer. I've volunteered to a great degree to get the good word out about STEM, probably reaching as many grade school children as I did my entire time in high school. There's still room for improvement, though; I told myself I would get involved in hockey but have failed to do so yet, and I've been on the fence about a workout routine for some time now. Maybe putting all that into words will finally shame me into action, or so I hope.

Honestly, with the whirlwind that sophomore year has been, I wonder what I was doing at all during my freshman year. On paper, freshman year was pretty quiet for the most part; classes, some volunteering activities here and there, not much else. Ultimately, I just think I was simply unaware of how far I could truly push myself.

There's a balance to be found between work and pleasure. Some of my fondest memories from freshman year are of me doing inane things with friends, or lazing around doing nothing in particular. I remember a particular afternoon a copy of Watchmen caught my eye in the university bookstore, and I spent several subsequent afternoons reading it on the steps in front of the plinth at the center of campus. It's a really pretty spot, still one of my favorites on campus. The pool surrounding the plinth, the row of trees adorning the linear pools further on that you can just make out if you crane your neck from the top of the steps. It wasn't there my freshman year, but immediately behind where I once read now stands the largest indoor Starbucks in the state, so there's that too.

I wouldn't trade any of those memories for the world, but I wouldn't part with any of my newfound achievements, either; both have their place. I think it does one no good to be on top of the world, but have nobody to celebrate it with. Conversely, I would almost consider my life wasted if I spent it in a purely cavalier manner - there is far too much I want to do, to see, to contribute, to burn it all away on pleasure. I don't claim to know how much of both is right for me or anyone else, mind you; I wouldn't be surprised if I found myself dissatisfied with sophomore year come junior year.

Yet, to a certain extent, I welcome that sense of discontent. I take it to mean that I am holding myself to a higher standard than I did before. At the end of the day, how much more can you ask from someone aside from steady improvement? As long as you know where you stand and where you're headed, I don't think you can go wrong.

So here's to the latter half of undergraduate life, with even more joy and achievements to be had. I hope one day I'll be able to look back on sophomore year with dissatisfaction.

Sunday, October 1, 2017

Schego Part 2, Or On Parsing and Other Related Stuff

Alright, part 2. And only a couple weeks later than planned to boot!

Last time, we discussed the basics of lexing - turning a string into a list of tokens - and the ins and outs of how that's done. So now, we'll move onto parsing, which is the next step in the process of turning our original source code into something the computer can understand.
The rationale behind the need for the parsing process is very simple, and mirrors that of the need for lexing: It's a lot easier to reason about generating code if you're working at the statement/expression level instead of at the token level. Think about it. If you were trying to write a program using a description of what that program is supposed to do, wouldn't you rather the description say "put an if statement here" instead of "stick a semicolon here and a number there"?
Additionally - and this is probably at least two future blog posts away - it's a lot easier to perform optimizations when thinking at the statement level as well. If you have a general idea of what the original code is attempting to do, you can potentially replace it with something faster. We'll need other data structures besides what our parsing process implements, but we'll cross that bridge when we get to it.
So, how is Schego's parser built? At a high level, much the same way as any parser: We represent the program's structure with a tree, known as an AST (Abstract Syntax Tree), with each individual node adhering to this interface:

// base interface for functions needing to accept any kind of AST node
type AstNode interface {
 GetSubNodes() []AstNode
 AddSubNode(AstNode)
 GetType() AstNodeType
 DebugString() string
}

If you recall the first article, I didn't create any interfaces there, so why the change? There will be functions - as we'll see in a moment - that should ideally work with/return any potential combination of AST nodes; when you're parsing a language where literally everything evaluates to some kind of expression, you need flexibility without repeating yourself. Due to Go not having partially abstract classes, I had to split the definition and implementation into two separate places, with the definition being in AstNode and the baseline implementation in another struct, SExp - short for S-Expression, which is what the structure of LISP/Scheme is based off of:

// base struct that all AST node implementations build off of
type SExp struct {
 subNodes []AstNode
}

func (s *SExp) GetSubNodes() []AstNode {
 return s.subNodes
}
func (s *SExp) AddSubNode(node AstNode) {
 s.subNodes = append(s.subNodes, node)
}

Still pretty straightforward, right? Specific nodes implement GetType() and DebugString() to return their proper values, and the basic stuff that should only be written once gets written once. By the way, GetType() returns a value from an enum describing what sort of node the struct is. This isn't normally something we'll need, but for certain optimizations, including the tail-call optimization technique required for all Scheme implementations, it'll be handy to see if some node has other sub-nodes with a certain type.
For the actual implementation of one of the nodes, here's the current implementation of if expressions:

type IfExp struct {
 SExp
}

func NewIfExp(cond AstNode, onTrue AstNode, onFalse AstNode) *IfExp {
 node := new(IfExp)
 node.AddSubNode(cond)
 node.AddSubNode(onTrue)
 node.AddSubNode(onFalse)
 return node
}
func (i IfExp) GetType() AstNodeType {
 return IfNode
}
func (i IfExp) DebugString() string {
 return "IfExp(" + i.subNodes[0].DebugString() + ", " + i.subNodes[1].DebugString() + ", " + i.subNodes[2].DebugString() + ")"
}

Nothing too out of the ordinary here. But what about literals?

type StringLiteral struct {
 SExp
 Value string
}

func NewStringLiteral(value string) *StringLiteral {
 node := new(StringLiteral)
 node.Value = value
 return node
}
func (s StringLiteral) GetType() AstNodeType {
 return StringNode
}
func (s StringLiteral) DebugString() string {
 return "\"" + s.Value + "\""
}

You might think that there's really no need for a string literal - or any other sort of literal, really - to have the ability to contain sub-nodes, which is an ability brought in by the SExp struct. And you'd be right; this is something of a leaky abstraction we've got going on here. For now, I think it's fine. Any code that will actually be messing with the AST after the fact should be calling GetType() in the first place, and due to the way parsing works in Schego, nothing else except post-AST-construction stuff will be modifying the literal except the exact same code that creates the literal from the original tokens in the first place. I'll also admit that I don't know for sure if what I describe here is an idiomatic Go solution to the problem, but it seemed fairly intuitive to me. If there's something more elegant out there, do let me know - I'm all in favor of doing no more work than necessary!
On a side note, structuring our data like this lets us write our tests fairly easily. All we have to do is compare the AST our parser made with one made ourselves in the test, like this:

func TestParseSingleExp(t *testing.T) {
 tokens := LexExp("(+ 5 3)")
 program := ParseTokens(tokens)
 expectedProgram := NewProgram(NewAddExp(NewIntLiteral(5), NewIntLiteral(3)))
 checkProgram(program, expectedProgram, t)
}

But this article was about parsing, right? How do we actually do the parsing bit? We've covered the data structures behind the process, but haven't detailed the actual parsing algorithm itself. Schego's parsing algorithm is what is known as a "recursive descent" parser; essentially, we define our "grammar" - our set of rules that describes our language - recursively. Take the if expression, for instance. We might say it's defined by the word "if," followed by a conditional expression, followed by two more expressions that will be evaluated if the conditional evaluates to true or false.
What do we define as an expression? It could be a lot of different things - a literal, a math formula, a function, another if expression (!), lots of stuff. What's important is that when we're constructing an if expression, we ask the parser to run its general expression-parsing function for the if's expressions. As long as we handle the base cases (literals/identifiers) properly, we can throw all sorts of wonky expressions at our if-parsing routine and it'll handle it just fine due to the recursion.
Additionally, the resultant code is fairly reusable inside the parser itself. If we define a function to parse, say, a list literal, we can simply call that function any time we expect a list literal, and it handles all the nitty-gritty details of validation and list-literal-specific parsing for us. Pretty much a win-win no matter how you look at it. But enough justification. What does Schego's parsing code look like?

// ParseTokens takes tokens and returns an AST (Abstract Syntax Tree) representation
func ParseTokens(tokens []*Token) *Program {
 program := NewProgram()
 currentIndex := 0
 for len(tokens)-1 >= currentIndex {
  node, _ := parseExpression(tokens, &currentIndex)
  program.AddSubNode(node)
 }
 return program
}

// accept checks to see if the current token matches a given token type, and advances if so
func accept(tokens []*Token, expectedType TokenType, currentIndex *int) bool {
 if tokens[*currentIndex].Type == expectedType {
  *currentIndex++
  return true
 }
 return false
}

// grabAccepted returns the token just before current, useful for grabbing the value of an accepted token
func grabAccepted(tokens []*Token, currentIndex *int) *Token {
 return tokens[*currentIndex-1]
}

// expect returns an error if the current token doesn't match the given type
func expect(tokens []*Token, expectedType TokenType, currentIndex *int) error {
 if len(tokens)-1 < *currentIndex {
  return errors.New("Unexpected EOF")
 } else if tokens[*currentIndex].Type != expectedType {
  return errors.New("Unexpected token")
 }
 return nil
}

func parseExpression(tokens []*Token, currentIndex *int) (AstNode, error) {
 // try literals first
 if accept(tokens, TokenIntLiteral, currentIndex) {
  literal := grabAccepted(tokens, currentIndex)
  return NewIntLiteral(bufferToInt(literal.Value)), nil
 } else if accept(tokens, TokenFloatLiteral, currentIndex) {
  literal := grabAccepted(tokens, currentIndex)
  return NewFloatLiteral(bufferToFloat(literal.Value)), nil
 } else if accept(tokens, TokenStringLiteral, currentIndex) {
  literal := grabAccepted(tokens, currentIndex)
  return NewStringLiteral(literal.Value.String()), nil
 } else if accept(tokens, TokenBoolLiteral, currentIndex) {
  literal := grabAccepted(tokens, currentIndex)
  return NewBoolLiteral(literal.Value.Bytes()[0] == 1), nil
 }
 // not a literal, attempt to parse an expression
 lparenError := expect(tokens, TokenLParen, currentIndex)
 if lparenError != nil {
  return nil, lparenError
 }
 // jump past the lparen
 *currentIndex++
 if accept(tokens, TokenOp, currentIndex) {
  // operator-parsing code here
 }
 if accept(tokens, TokenIdent, currentIndex) {
  identToken := grabAccepted(tokens, currentIndex)
  switch identToken.Value.String() {
  case "if":
   // TODO: error-handling here (and throughout the parser!)
   cond, _ := parseExpression(tokens, currentIndex)
   ifTrue, _ := parseExpression(tokens, currentIndex)
   ifFalse, _ := parseExpression(tokens, currentIndex)
   ifNode := NewIfExp(cond, ifTrue, ifFalse)
   expError := closeExp(tokens, currentIndex)
   if expError != nil {
    return nil, expError
   }
   return ifNode, nil
  }
 }
 // no matches?
 return nil, errors.New("Unexpected token")
}

// convenience function to ensure an expression is properly closed
func closeExp(tokens []*Token, currentIndex *int) error {
 rparenError := expect(tokens, TokenRParen, currentIndex)
 if rparenError != nil {
  return rparenError
 }
 *currentIndex += 1
 return nil
}

func bufferToInt(buffer bytes.Buffer) int64 {
 num, _ := binary.Varint(buffer.Bytes())
 return num
}
func bufferToFloat(buffer bytes.Buffer) float64 {
 bits := binary.LittleEndian.Uint64(buffer.Bytes())
 return math.Float64frombits(bits)
}

I've gutted some case-specific stuff like the details of parsing operators in an attempt to get the point across as succinctly as possible, but the overarching algorithm is still there. You can see the recursive calls to parseExpression present in the if-parsing code, and the literal-parsing code that forms the base cases. This basic setup is enough to get this tricky-looking test to pass:

func TestIfExp(t *testing.T) {
 tokens := LexExp("(if (> 6 5) \"true\" \"false\")")
 program := ParseTokens(tokens)
 expectedProgram := NewProgram(NewIfExp(NewGtExp(NewIntLiteral(6), NewIntLiteral(5)), NewStringLiteral("true"), NewStringLiteral("false")))
 checkProgram(program, expectedProgram, t)
}

All the if-parsing code knows is that an if node takes three expressions. It doesn't know anything about what constitutes a valid expression, or how to parse one; it leaves that up to the parseExpression() function, and it all Just Works.

As always, the code is up on GitHub if you want to take a look. We'll be looking at stack-based virtual machines next time, so we can actually do something with the AST we've generated. As I alluded to earlier, at some point in the (near? Distant?) future we'll also take a look at optimization, which is something I haven't really found in any similar how-to-make-a-language series. So that'll be something to look forward to.

Thursday, August 24, 2017

Schego Part 1, or Why Implementing Scheme in Go Isn't All That Crazy

So for anyone who has been following me on Github closely, this article is a week or two late - I want to apologize for that firstly. I've been busy with getting back up to speed with college and such. But on to the main event.

So, you may be wondering, what exactly is a Schego? It's my current baby, my main pet project, the first on the list of bullet points that I'd point to when people ask about what I've done. It's an R7RS Scheme implementation written in the Go programming language, with some nifty technical features planned for it down the road.

I imagine the first thing you'd ask is, why implement yet another Scheme interpreter, let alone one in Go of all languages? The answer is pretty simple and self-centered - I've been planning on learning Scheme since at least 2015, and I've taken an interest in Go as of late. Implementing a language won't familiarize you with its 3rd-party libraries, but it will give you a reasonably in-depth look at the way the language's syntax is built, and its various ins and outs. Even for a minimalist language such as Scheme, building an interpreter is no small undertaking, so the project stands to give me some reasonable experience in Go as well. Additionally, most of the relevant techniques and materials required to undertake such a project are usually senior-level classes, so it would look fairly good on a sophomore's resume, at least if I did it well.

As far as how to differentiate it from the other Scheme implementations out there, I have two main features in mind: The first is R7RS compatibility. This is the newest version of the official Scheme specification, ratified only about 4 years ago as of this time of writing. Only about 7-8 or so other Scheme interpreters currently implement the full R7RS spec to my knowledge, so in the Scheme world, that's reasonably rare air.

The second feature is a much more notable one, even though it arguably requires less work: Software transactional memory, or STM for short. It's a language implementation-level replacement for the lock-based multithreading paradigm that's usually implemented at the OS-level, unless we're talking about lightweight threads that are often cooperatively scheduled. As the name might imply, it works very similarly to the transactions you might find operating in a database or at a bank: Critical stretches of code are marked as atomic, and are run without actually committing any of their memory writes to RAM. At the end of the block, the first atomic block to write to a shared space of memory "wins," and its changes are committed. The rest are "rolled back" and made to run again. There's a little bit more to it than that, but that's the gist of it, and what will eventually end up in Schego. Unless someone with more time and/or motivation sees this and decides to one-up me, I think I'll have the first ever STM-capable Scheme implementation, too. So that's nice.

On top of doing all this, I've planned on writing a blog post every once in a while detailing the design decisions I've made in Schego, both as a documentation/teaching tool and as a justification method - if I can't defend it in writing, it needs to be reworked. At the time of this post, the only major Schego subsystem that's currently built is the lexer, and it's almost (but not quite) finished. For the uninitiated, a lexer is a system that takes in a stream of characters and turns it into a list of tokens, which are simply small bits of data that other code later on can use. For instance, a function or variable name would be a token, a mathematical operator would be a token, and so on and so forth. For the code that needs to figure out how to actually run our original Scheme program, it's much easier to think in terms of tokens than raw characters. So let's get on with that lexer, shall we?

I think usability is important. It's better to have software that's easy to use for the end-user and tough to implement compared to the other way around. Even if your end-user is another programmer, like if you were developing a library for instance, or even if your end-user is yourself, usability is still important. There's no easier way to create good library-level usability than to write software that has an interface that you'd want to use yourself, which in turn is done by writing the unit tests for your code before any underlying code has been written. Test-driven development has some other advantages (such as having reasonable code coverage if you genuinely do write all your tests before writing the code that makes them succeed), but I'm mostly concerned with this one facet. So how would I want a lexer to give me tokens back?

// test lexing a single s-expression
func TestLexSingleExp(t *testing.T) {
 tokens := LexExp("(abc def ghi)")
 expectedTokens := []*Token{
  NewTokenString(TokenLParen, "("),
  NewTokenString(TokenIdent, "abc"),
  NewTokenString(TokenIdent, "def"),
  NewTokenString(TokenIdent, "ghi"),
  NewTokenString(TokenRParen, ")")}
 checkTokens(tokens, expectedTokens, t)
}

This is what I came up with - the first unit test I wrote for Schego. It's really very simplistic - a LexExp function that takes a string to be lexed for tokens, and then returns a list of Token pointers containing their type and value. A single Token looks like this, by the way:

type TokenType int

const (
 TokenNone TokenType = iota
 TokenRParen
 TokenLParen
 TokenIdent
 TokenIntLiteral
 TokenFloatLiteral
 TokenDot
 TokenOp
 TokenChar
)

type Token struct {
 Type  TokenType
 Value bytes.Buffer
}

Pretty straightforward, right? I chose a bytebuffer to hold the value for the token since, to the best of my knowledge, there is no easier method for storing a generic bunch of bytes of uncertain value in Go. It could be my inexperience with the language showing, but this is the best method I've found.
So how does LexExp work?

// LexExp lexes an input string into Token objects. There are no possible user-facing
// errors from this process.
func LexExp(input string) []*Token {
 var tokens []*Token
 // accumulation variables for multi-character tokens such as idents and literals
 accumulating := false
 var accumulatingType TokenType
 var accumulatorBuffer bytes.Buffer
 for index, glyphRune := range input {
  glyph := string(glyphRune)
  if glyph == "(" {
   if accumulating == true {
    flushAccumulator(&accumulatingType, &accumulatorBuffer, &tokens)
    accumulating = false
   }
   tokens = append(tokens, NewTokenString(TokenLParen, glyph))
   // rparen
  } else if glyph == ")" {
   if accumulating == true {
    flushAccumulator(&accumulatingType, &accumulatorBuffer, &tokens)
    accumulating = false
   }
   tokens = append(tokens, NewTokenString(TokenRParen, glyph))
   // ident
  } else if unicode.IsLetter(glyphRune) {
   // were we building a number literal beforehand?
   if accumulating == true && accumulatingType != TokenIdent {
    flushAccumulator(&accumulatingType, &accumulatorBuffer, &tokens)
   }
   accumulating = true
   accumulatingType = TokenIdent
   accumulatorBuffer.WriteString(glyph)
   // we're not sure what this character is, let the parser deal with it
  } else {
   tokens = append(tokens, NewTokenString(TokenChar, glyph))
  }
 }
 // corner case if the input string while we're still accumulating
 // should never happen in proper Scheme, but still...
 if accumulating == true {
  tokens = append(tokens, NewTokenRaw(accumulatingType, accumulatorBuffer))
  accumulatorBuffer.Reset()
  accumulating = false
 }
 return tokens
}

If this isn't completely valid Go code, please forgive me. This is a rough recreation of what the original function looked like after implementing the first test case. Let's go over its internals all the same, though. You'll notice that an extended if/else is used instead of a switch block. Especially for those of you who read Lua's source code (which is a delightful read by the way - it's a very clear and well-constructed codebase), you might be wondering why I went with an if/else and not a switch. The main reason I did so is that R7RS is a little more complicated to lex - not parse! - compared to Lua, in my opinion. There's stuff like Unicode identifiers that really cannot be easily handled in a switch statement. I will admit that, similar to how Lua does things, a switch/case block with a default case that takes care of other situations that the switch could not would be worth a look, though; it's very possible I'll change LexExp over at a later date.

For multi-character tokens - identifiers, numerical/string literals, and so on - we have an accumulator buffer. Essentially, if we recognize that we've started lexing what could end up being a multi-character token, then we write it in the accumulator buffer instead of creating a token immediately. We then go on about our merry way accumulating (hence the name) characters in the accumulator buffer up until we find a character that cannot belong to a multi-character token, like a parenthesis, when that happens, we "flush" the accumulator and add a new token to the token list using the accumulator buffer's contents as a value before continuing to lex our not-multi-char token.
Lua does things a little differently. If it encounters a character that's (potentially) the start of a multi-character token, it lets each individual lexing function take care of proceeding further on down the line until it finds a suitable place to stop. For example, this is how Lua 5.3 handles numbers:

/* LUA_NUMBER */
/*
** this function is quite liberal in what it accepts, as 'luaO_str2num'
** will reject ill-formed numerals.
*/
static int read_numeral (LexState *ls, SemInfo *seminfo) {
  TValue obj;
  const char *expo = "Ee";
  int first = ls->current;
  lua_assert(lisdigit(ls->current));
  save_and_next(ls);
  if (first == '0' && check_next2(ls, "xX"))  /* hexadecimal? */
    expo = "Pp";
  for (;;) {
    if (check_next2(ls, expo))  /* exponent part? */
      check_next2(ls, "-+");  /* optional exponent sign */
    if (lisxdigit(ls->current))
      save_and_next(ls);
    else if (ls->current == '.')
      save_and_next(ls);
    else break;
  }
  save(ls, '\0');
  if (luaO_str2num(luaZ_buffer(ls->buff), &obj) == 0)  /* format error? */
    lexerror(ls, "malformed number", TK_FLT);
  if (ttisinteger(&obj)) {
    seminfo->i = ivalue(&obj);
    return TK_INT;
  }
  else {
    lua_assert(ttisfloat(&obj));
    seminfo->r = fltvalue(&obj);
    return TK_FLT;
  }
}

See the loop there? For comparison, this is what Schego's number-lexing code looks like, which admittedly doesn't handle things like exponents:

         } else if glyph == "." {
   // . is a valid character in an ident - add it to the accumulator
   // if we were building an ident
   if accumulating == true && accumulatingType == TokenIdent {
    accumulatorBuffer.WriteString(glyph)
    // we can't start an ident with . - are we building a floating point literal?
   } else if chr := peek(input, index); !unicode.IsSpace(chr) && unicode.IsNumber(chr) {
    accumulating = true
    accumulatingType = TokenFloatLiteral
    accumulatorBuffer.WriteString(glyph)
    // there's situations where a standalone . is valid
   } else {
    tokens = append(tokens, NewTokenString(TokenDot, glyph))
   }
 // number literal
  } else if unicode.IsNumber(glyphRune) {
   if accumulating == true && accumulatingType == TokenIdent {
    flushAccumulator(&accumulatingType, &accumulatorBuffer, &tokens)
   }
   accumulating = true
   // only declare that we are accumulating an int if we didn't see a . already
   if accumulatingType != TokenFloatLiteral {
    accumulatingType = TokenIntLiteral
   }
   accumulatorBuffer.WriteString(glyph)
   // we're not sure what this character is, let the parser deal with it
  }

Something you'll notice comparing the two - or something you might've figured out from reading the leading comment on LexExp - is that I intend on not having any errors occur during the lexing process. I think malformed code is easier to catch during parsing, and that the resulting error messages are still informative enough such that the culprit has enough info to to correct their mistake. Is this actually the case? This is simply me theorycrafting, so we'll have to wait and see.
But getting back to numbers and lexing numbers, personally I feel my method leads to less overall boilerplate, though you do have to check the accumulator essentially every time you grab a new token. You also need functions to manipulate the accumulator, which look like this:

// bufferStringToNum takes an input buffer and converts it from a string of
// character bytes to a float/int
func bufferStringToNum(tokenType TokenType, inputBuffer bytes.Buffer) *bytes.Buffer {
 bufferString := inputBuffer.String()
 var byteBuffer [binary.MaxVarintLen64]byte
 if tokenType == TokenFloatLiteral {
  num, _ := strconv.ParseFloat(bufferString, 64)
  binary.LittleEndian.PutUint64(byteBuffer[:], math.Float64bits(num))
 } else {
  num, _ := strconv.ParseInt(bufferString, 10, 64)
  binary.PutVarint(byteBuffer[:], num)
 }
 returnBuffer := bytes.NewBuffer(byteBuffer[:])
 return returnBuffer
}

// flushAccumulator empties the contents of the given Buffer into a new Token
// and resets it and the accumulator token type. A convenience function for LexExp.
func flushAccumulator(
 accumulatorType *TokenType,
 accumulatorBuffer *bytes.Buffer,
 tokenBuffer *[]*Token) {
 if *accumulatorType == TokenFloatLiteral || *accumulatorType == TokenIntLiteral {
  convertedBuffer := bufferStringToNum(*accumulatorType, *accumulatorBuffer)
  *tokenBuffer = append(*tokenBuffer, NewTokenRaw(*accumulatorType, *convertedBuffer))
 } else {
  *tokenBuffer = append(*tokenBuffer, NewTokenRaw(*accumulatorType, *accumulatorBuffer))
 }
 accumulatorBuffer.Reset()
 *accumulatorType = TokenNone
}

Not that bad though, right? Most of this code will need to be written only the one time; the checks for the accumulator are the only stretches that have to be done repeatedly throughout LexExp.

That's all for this first installment. The actual lexer is a bit longer to handle stuff like operators and other things, but the basic accumulator-based framework is still there. If you want to see what the lexer currently looks like, the code is up on GitHub if you want to look further. The complete lexer test suite can also be found there. Next time we'll be taking a look at parsing and a little bit of parser theory!