Thursday, August 24, 2017

Schego Part 1, or Why Implementing Scheme in Go Isn't All That Crazy

So for anyone who has been following me on Github closely, this article is a week or two late - I want to apologize for that firstly. I've been busy with getting back up to speed with college and such. But on to the main event.

So, you may be wondering, what exactly is a Schego? It's my current baby, my main pet project, the first on the list of bullet points that I'd point to when people ask about what I've done. It's an R7RS Scheme implementation written in the Go programming language, with some nifty technical features planned for it down the road.

I imagine the first thing you'd ask is, why implement yet another Scheme interpreter, let alone one in Go of all languages? The answer is pretty simple and self-centered - I've been planning on learning Scheme since at least 2015, and I've taken an interest in Go as of late. Implementing a language won't familiarize you with its 3rd-party libraries, but it will give you a reasonably in-depth look at the way the language's syntax is built, and its various ins and outs. Even for a minimalist language such as Scheme, building an interpreter is no small undertaking, so the project stands to give me some reasonable experience in Go as well. Additionally, most of the relevant techniques and materials required to undertake such a project are usually senior-level classes, so it would look fairly good on a sophomore's resume, at least if I did it well.

As far as how to differentiate it from the other Scheme implementations out there, I have two main features in mind: The first is R7RS compatibility. This is the newest version of the official Scheme specification, ratified only about 4 years ago as of this time of writing. Only about 7-8 or so other Scheme interpreters currently implement the full R7RS spec to my knowledge, so in the Scheme world, that's reasonably rare air.

The second feature is a much more notable one, even though it arguably requires less work: Software transactional memory, or STM for short. It's a language implementation-level replacement for the lock-based multithreading paradigm that's usually implemented at the OS-level, unless we're talking about lightweight threads that are often cooperatively scheduled.  As the name might imply, it works very similarly to the transactions you might find operating in a database or at a bank: Critical stretches of code are marked as atomic, and are run without actually committing any of their memory writes to RAM. At the end of the block, the first atomic block to write to a shared space of memory "wins," and its changes are committed. The rest are "rolled back" and made to run again. There's a little bit more to it than that, but that's the gist of it, and what will eventually end up in Schego. Unless someone with more time and/or motivation sees this and decides to one-up me, I think I'll have the first ever STM-capable Scheme implementation, too. So that's nice.

On top of doing all this, I've planned on writing a blog post every once in a while detailing the design decisions I've made in Schego, both as a documentation/teaching tool and as a justification method - if I can't defend it in writing, it needs to be reworked. At the time of this post, the only major Schego subsystem that's currently built is the lexer, and it's almost (but not quite) finished. For the uninitiated, a lexer is a system that takes in a stream of characters and turns it into a list of tokens, which are simply small bits of data that other code later on can use. For instance, a function or variable name would be a token, a mathematical operator would be a token, and so on and so forth. For the code that needs to figure out how to actually run our original Scheme program, it's much easier to think in terms of tokens than raw characters. So let's get on with that lexer, shall we?

I think usability is important. It's better to have software that's easy to use for the end-user and tough to implement compared to the other way around. Even if your end-user is another programmer, like if you were developing a library for instance, or even if your end-user is yourself, usability is still important. There's no easier way to create good library-level usability than to write software that has an interface that you'd want to use yourself, which in turn is done by writing the unit tests for your code before any underlying code has been written. Test-driven development has some other advantages (such as having reasonable code coverage if you genuinely do write all your tests before writing the code that makes them succeed), but I'm mostly concerned with this one facet. So how would I want a lexer to give me tokens back?
// test lexing a single s-expression
func TestLexSingleExp(t *testing.T) {
 tokens := LexExp("(abc def ghi)")
 expectedTokens := []*Token{
  NewTokenString(TokenLParen, "("),
  NewTokenString(TokenIdent, "abc"),
  NewTokenString(TokenIdent, "def"),
  NewTokenString(TokenIdent, "ghi"),
  NewTokenString(TokenRParen, ")")}
 checkTokens(tokens, expectedTokens, t)
}
This is what I came up with - the first unit test I wrote for Schego. It's really very simplistic - a LexExp function that takes a string to be lexed for tokens, and then returns a list of Token pointers containing their type and value. A single Token looks like this, by the way:
type TokenType int

const (
 TokenNone TokenType = iota
 TokenRParen
 TokenLParen
 TokenIdent
 TokenIntLiteral
 TokenFloatLiteral
 TokenDot
 TokenOp
 TokenChar
)

type Token struct {
 Type  TokenType
 Value bytes.Buffer
}
Pretty straightforward, right? I chose a bytebuffer to hold the value for the token since, to the best of my knowledge, there is no easier method for storing a generic bunch of bytes of uncertain value in Go. It could be my inexperience with the language showing, but this is the best method I've found.
So how does LexExp work?
// LexExp lexes an input string into Token objects. There are no possible user-facing
// errors from this process.
func LexExp(input string) []*Token {
 var tokens []*Token
 // accumulation variables for multi-character tokens such as idents and literals
 accumulating := false
 var accumulatingType TokenType
 var accumulatorBuffer bytes.Buffer
 for index, glyphRune := range input {
  glyph := string(glyphRune)
  if glyph == "(" {
   if accumulating == true {
    flushAccumulator(&accumulatingType, &accumulatorBuffer, &tokens)
    accumulating = false
   }
   tokens = append(tokens, NewTokenString(TokenLParen, glyph))
   // rparen
  } else if glyph == ")" {
   if accumulating == true {
    flushAccumulator(&accumulatingType, &accumulatorBuffer, &tokens)
    accumulating = false
   }
   tokens = append(tokens, NewTokenString(TokenRParen, glyph))
   // ident
  } else if unicode.IsLetter(glyphRune) {
   // were we building a number literal beforehand?
   if accumulating == true && accumulatingType != TokenIdent {
    flushAccumulator(&accumulatingType, &accumulatorBuffer, &tokens)
   }
   accumulating = true
   accumulatingType = TokenIdent
   accumulatorBuffer.WriteString(glyph)
   // we're not sure what this character is, let the parser deal with it
  } else {
   tokens = append(tokens, NewTokenString(TokenChar, glyph))
  }
 }
 // corner case if the input string while we're still accumulating
 // should never happen in proper Scheme, but still...
 if accumulating == true {
  tokens = append(tokens, NewTokenRaw(accumulatingType, accumulatorBuffer))
  accumulatorBuffer.Reset()
  accumulating = false
 }
 return tokens
}
If this isn't completely valid Go code, please forgive me. This is a rough recreation of what the original function looked like after implementing the first test case. Let's go over its internals all the same, though. You'll notice that an extended if/else is used instead of a switch block. Especially for those of you who read Lua's source code (which is a delightful read by the way - it's a very clear and well-constructed codebase), you might be wondering why I went with an if/else and not a switch. The main reason I did so is that R7RS is a little more complicated to lex - not parse! - compared to Lua, in my opinion. There's stuff like Unicode identifiers that really cannot be easily handled in a switch statement. I will admit that, similar to how Lua does things, a switch/case block with a default case that takes care of other situations that the switch could not would be worth a look, though; it's very possible I'll change LexExp over at a later date.

For multi-character tokens - identifiers, numerical/string literals, and so on - we have an accumulator buffer. Essentially, if we recognize that we've started lexing what could end up being a multi-character token, then we write it in the accumulator buffer instead of creating a token immediately. We then go on about our merry way accumulating (hence the name) characters in the accumulator buffer up until we find a character that cannot belong to a multi-character token, like a parenthesis, when that happens, we "flush" the accumulator and add a new token to the token list using the accumulator buffer's contents as a value before continuing to lex our not-multi-char token.
Lua does things a little differently. If it encounters a character that's (potentially) the start of a multi-character token, it lets each individual lexing function take care of proceeding further on down the line until it finds a suitable place to stop. For example, this is how Lua 5.3 handles numbers:
/* LUA_NUMBER */
/*
** this function is quite liberal in what it accepts, as 'luaO_str2num'
** will reject ill-formed numerals.
*/
static int read_numeral (LexState *ls, SemInfo *seminfo) {
  TValue obj;
  const char *expo = "Ee";
  int first = ls->current;
  lua_assert(lisdigit(ls->current));
  save_and_next(ls);
  if (first == '0' && check_next2(ls, "xX"))  /* hexadecimal? */
    expo = "Pp";
  for (;;) {
    if (check_next2(ls, expo))  /* exponent part? */
      check_next2(ls, "-+");  /* optional exponent sign */
    if (lisxdigit(ls->current))
      save_and_next(ls);
    else if (ls->current == '.')
      save_and_next(ls);
    else break;
  }
  save(ls, '\0');
  if (luaO_str2num(luaZ_buffer(ls->buff), &obj) == 0)  /* format error? */
    lexerror(ls, "malformed number", TK_FLT);
  if (ttisinteger(&obj)) {
    seminfo->i = ivalue(&obj);
    return TK_INT;
  }
  else {
    lua_assert(ttisfloat(&obj));
    seminfo->r = fltvalue(&obj);
    return TK_FLT;
  }
}
See the loop there? For comparison, this is what Schego's number-lexing code looks like, which admittedly doesn't handle things like exponents:
         } else if glyph == "." {
   // . is a valid character in an ident - add it to the accumulator
   // if we were building an ident
   if accumulating == true && accumulatingType == TokenIdent {
    accumulatorBuffer.WriteString(glyph)
    // we can't start an ident with . - are we building a floating point literal?
   } else if chr := peek(input, index); !unicode.IsSpace(chr) && unicode.IsNumber(chr) {
    accumulating = true
    accumulatingType = TokenFloatLiteral
    accumulatorBuffer.WriteString(glyph)
    // there's situations where a standalone . is valid
   } else {
    tokens = append(tokens, NewTokenString(TokenDot, glyph))
   }
 // number literal
  } else if unicode.IsNumber(glyphRune) {
   if accumulating == true && accumulatingType == TokenIdent {
    flushAccumulator(&accumulatingType, &accumulatorBuffer, &tokens)
   }
   accumulating = true
   // only declare that we are accumulating an int if we didn't see a . already
   if accumulatingType != TokenFloatLiteral {
    accumulatingType = TokenIntLiteral
   }
   accumulatorBuffer.WriteString(glyph)
   // we're not sure what this character is, let the parser deal with it
  }
Something you'll notice comparing the two - or something you might've figured out from reading the leading comment on LexExp - is that I intend on not having any errors occur during the lexing process. I think malformed code is easier to catch during parsing, and that the resulting error messages are still informative enough such that the culprit has enough info to to correct their mistake. Is this actually the case? This is simply me theorycrafting, so we'll have to wait and see.
But getting back to numbers and lexing numbers, personally I feel my method leads to less overall boilerplate, though you do have to check the accumulator essentially every time you grab a new token. You also need functions to manipulate the accumulator, which look like this:
// bufferStringToNum takes an input buffer and converts it from a string of
// character bytes to a float/int
func bufferStringToNum(tokenType TokenType, inputBuffer bytes.Buffer) *bytes.Buffer {
 bufferString := inputBuffer.String()
 var byteBuffer [binary.MaxVarintLen64]byte
 if tokenType == TokenFloatLiteral {
  num, _ := strconv.ParseFloat(bufferString, 64)
  binary.LittleEndian.PutUint64(byteBuffer[:], math.Float64bits(num))
 } else {
  num, _ := strconv.ParseInt(bufferString, 10, 64)
  binary.PutVarint(byteBuffer[:], num)
 }
 returnBuffer := bytes.NewBuffer(byteBuffer[:])
 return returnBuffer
}

// flushAccumulator empties the contents of the given Buffer into a new Token
// and resets it and the accumulator token type. A convenience function for LexExp.
func flushAccumulator(
 accumulatorType *TokenType,
 accumulatorBuffer *bytes.Buffer,
 tokenBuffer *[]*Token) {
 if *accumulatorType == TokenFloatLiteral || *accumulatorType == TokenIntLiteral {
  convertedBuffer := bufferStringToNum(*accumulatorType, *accumulatorBuffer)
  *tokenBuffer = append(*tokenBuffer, NewTokenRaw(*accumulatorType, *convertedBuffer))
 } else {
  *tokenBuffer = append(*tokenBuffer, NewTokenRaw(*accumulatorType, *accumulatorBuffer))
 }
 accumulatorBuffer.Reset()
 *accumulatorType = TokenNone
}
Not that bad though, right? Most of this code will need to be written only the one time; the checks for the accumulator are the only stretches that have to be done repeatedly throughout LexExp.

That's all for this first installment. The actual lexer is a bit longer to handle stuff like operators and other things, but the basic accumulator-based framework is still there. If you want to see what the lexer currently looks like, the code is up on GitHub if you want to look further. The complete lexer test suite can also be found there. Next time we'll be taking a look at parsing and a little bit of parser theory!