Sunday, October 1, 2017

Schego Part 2, Or On Parsing and Other Related Stuff

Alright, part 2. And only a couple weeks later than planned to boot!

Last time, we discussed the basics of lexing - turning a string into a list of tokens - and the ins and outs of how that's done. So now, we'll move onto parsing, which is the next step in the process of turning our original source code into something the computer can understand.
The rationale behind the need for the parsing process is very simple, and mirrors that of the need for lexing: It's a lot easier to reason about generating code if you're working at the statement/expression level instead of at the token level. Think about it. If you were trying to write a program using a description of what that program is supposed to do, wouldn't you rather the description say "put an if statement here" instead of "stick a semicolon here and a number there"?
Additionally - and this is probably at least two future blog posts away - it's a lot easier to perform optimizations when thinking at the statement level as well. If you have a general idea of what the original code is attempting to do, you can potentially replace it with something faster. We'll need other data structures besides what our parsing process implements, but we'll cross that bridge when we get to it.
So, how is Schego's parser built? At a high level, much the same way as any parser: We represent the program's structure with a tree, known as an AST (Abstract Syntax Tree), with each individual node adhering to this interface:

// base interface for functions needing to accept any kind of AST node
type AstNode interface {
 GetSubNodes() []AstNode
 AddSubNode(AstNode)
 GetType() AstNodeType
 DebugString() string
}

If you recall the first article, I didn't create any interfaces there, so why the change? There will be functions - as we'll see in a moment - that should ideally work with/return any potential combination of AST nodes; when you're parsing a language where literally everything evaluates to some kind of expression, you need flexibility without repeating yourself. Due to Go not having partially abstract classes, I had to split the definition and implementation into two separate places, with the definition being in AstNode and the baseline implementation in another struct, SExp - short for S-Expression, which is what the structure of LISP/Scheme is based off of:

// base struct that all AST node implementations build off of
type SExp struct {
 subNodes []AstNode
}

func (s *SExp) GetSubNodes() []AstNode {
 return s.subNodes
}
func (s *SExp) AddSubNode(node AstNode) {
 s.subNodes = append(s.subNodes, node)
}

Still pretty straightforward, right? Specific nodes implement GetType() and DebugString() to return their proper values, and the basic stuff that should only be written once gets written once. By the way, GetType() returns a value from an enum describing what sort of node the struct is. This isn't normally something we'll need, but for certain optimizations, including the tail-call optimization technique required for all Scheme implementations, it'll be handy to see if some node has other sub-nodes with a certain type.
For the actual implementation of one of the nodes, here's the current implementation of if expressions:

type IfExp struct {
 SExp
}

func NewIfExp(cond AstNode, onTrue AstNode, onFalse AstNode) *IfExp {
 node := new(IfExp)
 node.AddSubNode(cond)
 node.AddSubNode(onTrue)
 node.AddSubNode(onFalse)
 return node
}
func (i IfExp) GetType() AstNodeType {
 return IfNode
}
func (i IfExp) DebugString() string {
 return "IfExp(" + i.subNodes[0].DebugString() + ", " + i.subNodes[1].DebugString() + ", " + i.subNodes[2].DebugString() + ")"
}

Nothing too out of the ordinary here. But what about literals?

type StringLiteral struct {
 SExp
 Value string
}

func NewStringLiteral(value string) *StringLiteral {
 node := new(StringLiteral)
 node.Value = value
 return node
}
func (s StringLiteral) GetType() AstNodeType {
 return StringNode
}
func (s StringLiteral) DebugString() string {
 return "\"" + s.Value + "\""
}

You might think that there's really no need for a string literal - or any other sort of literal, really - to have the ability to contain sub-nodes, which is an ability brought in by the SExp struct. And you'd be right; this is something of a leaky abstraction we've got going on here. For now, I think it's fine. Any code that will actually be messing with the AST after the fact should be calling GetType() in the first place, and due to the way parsing works in Schego, nothing else except post-AST-construction stuff will be modifying the literal except the exact same code that creates the literal from the original tokens in the first place. I'll also admit that I don't know for sure if what I describe here is an idiomatic Go solution to the problem, but it seemed fairly intuitive to me. If there's something more elegant out there, do let me know - I'm all in favor of doing no more work than necessary!
On a side note, structuring our data like this lets us write our tests fairly easily. All we have to do is compare the AST our parser made with one made ourselves in the test, like this:

func TestParseSingleExp(t *testing.T) {
 tokens := LexExp("(+ 5 3)")
 program := ParseTokens(tokens)
 expectedProgram := NewProgram(NewAddExp(NewIntLiteral(5), NewIntLiteral(3)))
 checkProgram(program, expectedProgram, t)
}

But this article was about parsing, right? How do we actually do the parsing bit? We've covered the data structures behind the process, but haven't detailed the actual parsing algorithm itself. Schego's parsing algorithm is what is known as a "recursive descent" parser; essentially, we define our "grammar" - our set of rules that describes our language - recursively. Take the if expression, for instance. We might say it's defined by the word "if," followed by a conditional expression, followed by two more expressions that will be evaluated if the conditional evaluates to true or false.
What do we define as an expression? It could be a lot of different things - a literal, a math formula, a function, another if expression (!), lots of stuff. What's important is that when we're constructing an if expression, we ask the parser to run its general expression-parsing function for the if's expressions. As long as we handle the base cases (literals/identifiers) properly, we can throw all sorts of wonky expressions at our if-parsing routine and it'll handle it just fine due to the recursion.
Additionally, the resultant code is fairly reusable inside the parser itself. If we define a function to parse, say, a list literal, we can simply call that function any time we expect a list literal, and it handles all the nitty-gritty details of validation and list-literal-specific parsing for us. Pretty much a win-win no matter how you look at it. But enough justification. What does Schego's parsing code look like?

// ParseTokens takes tokens and returns an AST (Abstract Syntax Tree) representation
func ParseTokens(tokens []*Token) *Program {
 program := NewProgram()
 currentIndex := 0
 for len(tokens)-1 >= currentIndex {
  node, _ := parseExpression(tokens, &currentIndex)
  program.AddSubNode(node)
 }
 return program
}

// accept checks to see if the current token matches a given token type, and advances if so
func accept(tokens []*Token, expectedType TokenType, currentIndex *int) bool {
 if tokens[*currentIndex].Type == expectedType {
  *currentIndex++
  return true
 }
 return false
}

// grabAccepted returns the token just before current, useful for grabbing the value of an accepted token
func grabAccepted(tokens []*Token, currentIndex *int) *Token {
 return tokens[*currentIndex-1]
}

// expect returns an error if the current token doesn't match the given type
func expect(tokens []*Token, expectedType TokenType, currentIndex *int) error {
 if len(tokens)-1 < *currentIndex {
  return errors.New("Unexpected EOF")
 } else if tokens[*currentIndex].Type != expectedType {
  return errors.New("Unexpected token")
 }
 return nil
}

func parseExpression(tokens []*Token, currentIndex *int) (AstNode, error) {
 // try literals first
 if accept(tokens, TokenIntLiteral, currentIndex) {
  literal := grabAccepted(tokens, currentIndex)
  return NewIntLiteral(bufferToInt(literal.Value)), nil
 } else if accept(tokens, TokenFloatLiteral, currentIndex) {
  literal := grabAccepted(tokens, currentIndex)
  return NewFloatLiteral(bufferToFloat(literal.Value)), nil
 } else if accept(tokens, TokenStringLiteral, currentIndex) {
  literal := grabAccepted(tokens, currentIndex)
  return NewStringLiteral(literal.Value.String()), nil
 } else if accept(tokens, TokenBoolLiteral, currentIndex) {
  literal := grabAccepted(tokens, currentIndex)
  return NewBoolLiteral(literal.Value.Bytes()[0] == 1), nil
 }
 // not a literal, attempt to parse an expression
 lparenError := expect(tokens, TokenLParen, currentIndex)
 if lparenError != nil {
  return nil, lparenError
 }
 // jump past the lparen
 *currentIndex++
 if accept(tokens, TokenOp, currentIndex) {
  // operator-parsing code here
 }
 if accept(tokens, TokenIdent, currentIndex) {
  identToken := grabAccepted(tokens, currentIndex)
  switch identToken.Value.String() {
  case "if":
   // TODO: error-handling here (and throughout the parser!)
   cond, _ := parseExpression(tokens, currentIndex)
   ifTrue, _ := parseExpression(tokens, currentIndex)
   ifFalse, _ := parseExpression(tokens, currentIndex)
   ifNode := NewIfExp(cond, ifTrue, ifFalse)
   expError := closeExp(tokens, currentIndex)
   if expError != nil {
    return nil, expError
   }
   return ifNode, nil
  }
 }
 // no matches?
 return nil, errors.New("Unexpected token")
}

// convenience function to ensure an expression is properly closed
func closeExp(tokens []*Token, currentIndex *int) error {
 rparenError := expect(tokens, TokenRParen, currentIndex)
 if rparenError != nil {
  return rparenError
 }
 *currentIndex += 1
 return nil
}

func bufferToInt(buffer bytes.Buffer) int64 {
 num, _ := binary.Varint(buffer.Bytes())
 return num
}
func bufferToFloat(buffer bytes.Buffer) float64 {
 bits := binary.LittleEndian.Uint64(buffer.Bytes())
 return math.Float64frombits(bits)
}

I've gutted some case-specific stuff like the details of parsing operators in an attempt to get the point across as succinctly as possible, but the overarching algorithm is still there. You can see the recursive calls to parseExpression present in the if-parsing code, and the literal-parsing code that forms the base cases. This basic setup is enough to get this tricky-looking test to pass:

func TestIfExp(t *testing.T) {
 tokens := LexExp("(if (> 6 5) \"true\" \"false\")")
 program := ParseTokens(tokens)
 expectedProgram := NewProgram(NewIfExp(NewGtExp(NewIntLiteral(6), NewIntLiteral(5)), NewStringLiteral("true"), NewStringLiteral("false")))
 checkProgram(program, expectedProgram, t)
}

All the if-parsing code knows is that an if node takes three expressions. It doesn't know anything about what constitutes a valid expression, or how to parse one; it leaves that up to the parseExpression() function, and it all Just Works.

As always, the code is up on GitHub if you want to take a look. We'll be looking at stack-based virtual machines next time, so we can actually do something with the AST we've generated. As I alluded to earlier, at some point in the (near? Distant?) future we'll also take a look at optimization, which is something I haven't really found in any similar how-to-make-a-language series. So that'll be something to look forward to.

Thursday, August 24, 2017

Schego Part 1, or Why Implementing Scheme in Go Isn't All That Crazy

So for anyone who has been following me on Github closely, this article is a week or two late - I want to apologize for that firstly. I've been busy with getting back up to speed with college and such. But on to the main event.

So, you may be wondering, what exactly is a Schego? It's my current baby, my main pet project, the first on the list of bullet points that I'd point to when people ask about what I've done. It's an R7RS Scheme implementation written in the Go programming language, with some nifty technical features planned for it down the road.

I imagine the first thing you'd ask is, why implement yet another Scheme interpreter, let alone one in Go of all languages? The answer is pretty simple and self-centered - I've been planning on learning Scheme since at least 2015, and I've taken an interest in Go as of late. Implementing a language won't familiarize you with its 3rd-party libraries, but it will give you a reasonably in-depth look at the way the language's syntax is built, and its various ins and outs. Even for a minimalist language such as Scheme, building an interpreter is no small undertaking, so the project stands to give me some reasonable experience in Go as well. Additionally, most of the relevant techniques and materials required to undertake such a project are usually senior-level classes, so it would look fairly good on a sophomore's resume, at least if I did it well.

As far as how to differentiate it from the other Scheme implementations out there, I have two main features in mind: The first is R7RS compatibility. This is the newest version of the official Scheme specification, ratified only about 4 years ago as of this time of writing. Only about 7-8 or so other Scheme interpreters currently implement the full R7RS spec to my knowledge, so in the Scheme world, that's reasonably rare air.

The second feature is a much more notable one, even though it arguably requires less work: Software transactional memory, or STM for short. It's a language implementation-level replacement for the lock-based multithreading paradigm that's usually implemented at the OS-level, unless we're talking about lightweight threads that are often cooperatively scheduled.  As the name might imply, it works very similarly to the transactions you might find operating in a database or at a bank: Critical stretches of code are marked as atomic, and are run without actually committing any of their memory writes to RAM. At the end of the block, the first atomic block to write to a shared space of memory "wins," and its changes are committed. The rest are "rolled back" and made to run again. There's a little bit more to it than that, but that's the gist of it, and what will eventually end up in Schego. Unless someone with more time and/or motivation sees this and decides to one-up me, I think I'll have the first ever STM-capable Scheme implementation, too. So that's nice.

On top of doing all this, I've planned on writing a blog post every once in a while detailing the design decisions I've made in Schego, both as a documentation/teaching tool and as a justification method - if I can't defend it in writing, it needs to be reworked. At the time of this post, the only major Schego subsystem that's currently built is the lexer, and it's almost (but not quite) finished. For the uninitiated, a lexer is a system that takes in a stream of characters and turns it into a list of tokens, which are simply small bits of data that other code later on can use. For instance, a function or variable name would be a token, a mathematical operator would be a token, and so on and so forth. For the code that needs to figure out how to actually run our original Scheme program, it's much easier to think in terms of tokens than raw characters. So let's get on with that lexer, shall we?

I think usability is important. It's better to have software that's easy to use for the end-user and tough to implement compared to the other way around. Even if your end-user is another programmer, like if you were developing a library for instance, or even if your end-user is yourself, usability is still important. There's no easier way to create good library-level usability than to write software that has an interface that you'd want to use yourself, which in turn is done by writing the unit tests for your code before any underlying code has been written. Test-driven development has some other advantages (such as having reasonable code coverage if you genuinely do write all your tests before writing the code that makes them succeed), but I'm mostly concerned with this one facet. So how would I want a lexer to give me tokens back?
// test lexing a single s-expression
func TestLexSingleExp(t *testing.T) {
 tokens := LexExp("(abc def ghi)")
 expectedTokens := []*Token{
  NewTokenString(TokenLParen, "("),
  NewTokenString(TokenIdent, "abc"),
  NewTokenString(TokenIdent, "def"),
  NewTokenString(TokenIdent, "ghi"),
  NewTokenString(TokenRParen, ")")}
 checkTokens(tokens, expectedTokens, t)
}
This is what I came up with - the first unit test I wrote for Schego. It's really very simplistic - a LexExp function that takes a string to be lexed for tokens, and then returns a list of Token pointers containing their type and value. A single Token looks like this, by the way:
type TokenType int

const (
 TokenNone TokenType = iota
 TokenRParen
 TokenLParen
 TokenIdent
 TokenIntLiteral
 TokenFloatLiteral
 TokenDot
 TokenOp
 TokenChar
)

type Token struct {
 Type  TokenType
 Value bytes.Buffer
}
Pretty straightforward, right? I chose a bytebuffer to hold the value for the token since, to the best of my knowledge, there is no easier method for storing a generic bunch of bytes of uncertain value in Go. It could be my inexperience with the language showing, but this is the best method I've found.
So how does LexExp work?
// LexExp lexes an input string into Token objects. There are no possible user-facing
// errors from this process.
func LexExp(input string) []*Token {
 var tokens []*Token
 // accumulation variables for multi-character tokens such as idents and literals
 accumulating := false
 var accumulatingType TokenType
 var accumulatorBuffer bytes.Buffer
 for index, glyphRune := range input {
  glyph := string(glyphRune)
  if glyph == "(" {
   if accumulating == true {
    flushAccumulator(&accumulatingType, &accumulatorBuffer, &tokens)
    accumulating = false
   }
   tokens = append(tokens, NewTokenString(TokenLParen, glyph))
   // rparen
  } else if glyph == ")" {
   if accumulating == true {
    flushAccumulator(&accumulatingType, &accumulatorBuffer, &tokens)
    accumulating = false
   }
   tokens = append(tokens, NewTokenString(TokenRParen, glyph))
   // ident
  } else if unicode.IsLetter(glyphRune) {
   // were we building a number literal beforehand?
   if accumulating == true && accumulatingType != TokenIdent {
    flushAccumulator(&accumulatingType, &accumulatorBuffer, &tokens)
   }
   accumulating = true
   accumulatingType = TokenIdent
   accumulatorBuffer.WriteString(glyph)
   // we're not sure what this character is, let the parser deal with it
  } else {
   tokens = append(tokens, NewTokenString(TokenChar, glyph))
  }
 }
 // corner case if the input string while we're still accumulating
 // should never happen in proper Scheme, but still...
 if accumulating == true {
  tokens = append(tokens, NewTokenRaw(accumulatingType, accumulatorBuffer))
  accumulatorBuffer.Reset()
  accumulating = false
 }
 return tokens
}
If this isn't completely valid Go code, please forgive me. This is a rough recreation of what the original function looked like after implementing the first test case. Let's go over its internals all the same, though. You'll notice that an extended if/else is used instead of a switch block. Especially for those of you who read Lua's source code (which is a delightful read by the way - it's a very clear and well-constructed codebase), you might be wondering why I went with an if/else and not a switch. The main reason I did so is that R7RS is a little more complicated to lex - not parse! - compared to Lua, in my opinion. There's stuff like Unicode identifiers that really cannot be easily handled in a switch statement. I will admit that, similar to how Lua does things, a switch/case block with a default case that takes care of other situations that the switch could not would be worth a look, though; it's very possible I'll change LexExp over at a later date.

For multi-character tokens - identifiers, numerical/string literals, and so on - we have an accumulator buffer. Essentially, if we recognize that we've started lexing what could end up being a multi-character token, then we write it in the accumulator buffer instead of creating a token immediately. We then go on about our merry way accumulating (hence the name) characters in the accumulator buffer up until we find a character that cannot belong to a multi-character token, like a parenthesis, when that happens, we "flush" the accumulator and add a new token to the token list using the accumulator buffer's contents as a value before continuing to lex our not-multi-char token.
Lua does things a little differently. If it encounters a character that's (potentially) the start of a multi-character token, it lets each individual lexing function take care of proceeding further on down the line until it finds a suitable place to stop. For example, this is how Lua 5.3 handles numbers:
/* LUA_NUMBER */
/*
** this function is quite liberal in what it accepts, as 'luaO_str2num'
** will reject ill-formed numerals.
*/
static int read_numeral (LexState *ls, SemInfo *seminfo) {
  TValue obj;
  const char *expo = "Ee";
  int first = ls->current;
  lua_assert(lisdigit(ls->current));
  save_and_next(ls);
  if (first == '0' && check_next2(ls, "xX"))  /* hexadecimal? */
    expo = "Pp";
  for (;;) {
    if (check_next2(ls, expo))  /* exponent part? */
      check_next2(ls, "-+");  /* optional exponent sign */
    if (lisxdigit(ls->current))
      save_and_next(ls);
    else if (ls->current == '.')
      save_and_next(ls);
    else break;
  }
  save(ls, '\0');
  if (luaO_str2num(luaZ_buffer(ls->buff), &obj) == 0)  /* format error? */
    lexerror(ls, "malformed number", TK_FLT);
  if (ttisinteger(&obj)) {
    seminfo->i = ivalue(&obj);
    return TK_INT;
  }
  else {
    lua_assert(ttisfloat(&obj));
    seminfo->r = fltvalue(&obj);
    return TK_FLT;
  }
}
See the loop there? For comparison, this is what Schego's number-lexing code looks like, which admittedly doesn't handle things like exponents:
         } else if glyph == "." {
   // . is a valid character in an ident - add it to the accumulator
   // if we were building an ident
   if accumulating == true && accumulatingType == TokenIdent {
    accumulatorBuffer.WriteString(glyph)
    // we can't start an ident with . - are we building a floating point literal?
   } else if chr := peek(input, index); !unicode.IsSpace(chr) && unicode.IsNumber(chr) {
    accumulating = true
    accumulatingType = TokenFloatLiteral
    accumulatorBuffer.WriteString(glyph)
    // there's situations where a standalone . is valid
   } else {
    tokens = append(tokens, NewTokenString(TokenDot, glyph))
   }
 // number literal
  } else if unicode.IsNumber(glyphRune) {
   if accumulating == true && accumulatingType == TokenIdent {
    flushAccumulator(&accumulatingType, &accumulatorBuffer, &tokens)
   }
   accumulating = true
   // only declare that we are accumulating an int if we didn't see a . already
   if accumulatingType != TokenFloatLiteral {
    accumulatingType = TokenIntLiteral
   }
   accumulatorBuffer.WriteString(glyph)
   // we're not sure what this character is, let the parser deal with it
  }
Something you'll notice comparing the two - or something you might've figured out from reading the leading comment on LexExp - is that I intend on not having any errors occur during the lexing process. I think malformed code is easier to catch during parsing, and that the resulting error messages are still informative enough such that the culprit has enough info to to correct their mistake. Is this actually the case? This is simply me theorycrafting, so we'll have to wait and see.
But getting back to numbers and lexing numbers, personally I feel my method leads to less overall boilerplate, though you do have to check the accumulator essentially every time you grab a new token. You also need functions to manipulate the accumulator, which look like this:
// bufferStringToNum takes an input buffer and converts it from a string of
// character bytes to a float/int
func bufferStringToNum(tokenType TokenType, inputBuffer bytes.Buffer) *bytes.Buffer {
 bufferString := inputBuffer.String()
 var byteBuffer [binary.MaxVarintLen64]byte
 if tokenType == TokenFloatLiteral {
  num, _ := strconv.ParseFloat(bufferString, 64)
  binary.LittleEndian.PutUint64(byteBuffer[:], math.Float64bits(num))
 } else {
  num, _ := strconv.ParseInt(bufferString, 10, 64)
  binary.PutVarint(byteBuffer[:], num)
 }
 returnBuffer := bytes.NewBuffer(byteBuffer[:])
 return returnBuffer
}

// flushAccumulator empties the contents of the given Buffer into a new Token
// and resets it and the accumulator token type. A convenience function for LexExp.
func flushAccumulator(
 accumulatorType *TokenType,
 accumulatorBuffer *bytes.Buffer,
 tokenBuffer *[]*Token) {
 if *accumulatorType == TokenFloatLiteral || *accumulatorType == TokenIntLiteral {
  convertedBuffer := bufferStringToNum(*accumulatorType, *accumulatorBuffer)
  *tokenBuffer = append(*tokenBuffer, NewTokenRaw(*accumulatorType, *convertedBuffer))
 } else {
  *tokenBuffer = append(*tokenBuffer, NewTokenRaw(*accumulatorType, *accumulatorBuffer))
 }
 accumulatorBuffer.Reset()
 *accumulatorType = TokenNone
}
Not that bad though, right? Most of this code will need to be written only the one time; the checks for the accumulator are the only stretches that have to be done repeatedly throughout LexExp.

That's all for this first installment. The actual lexer is a bit longer to handle stuff like operators and other things, but the basic accumulator-based framework is still there. If you want to see what the lexer currently looks like, the code is up on GitHub if you want to look further. The complete lexer test suite can also be found there. Next time we'll be taking a look at parsing and a little bit of parser theory!