Sunday, October 1, 2017

Schego Part 2, Or On Parsing and Other Related Stuff

Alright, part 2. And only a couple weeks later than planned to boot!

Last time, we discussed the basics of lexing - turning a string into a list of tokens - and the ins and outs of how that's done. So now, we'll move onto parsing, which is the next step in the process of turning our original source code into something the computer can understand.
The rationale behind the need for the parsing process is very simple, and mirrors that of the need for lexing: It's a lot easier to reason about generating code if you're working at the statement/expression level instead of at the token level. Think about it. If you were trying to write a program using a description of what that program is supposed to do, wouldn't you rather the description say "put an if statement here" instead of "stick a semicolon here and a number there"?
Additionally - and this is probably at least two future blog posts away - it's a lot easier to perform optimizations when thinking at the statement level as well. If you have a general idea of what the original code is attempting to do, you can potentially replace it with something faster. We'll need other data structures besides what our parsing process implements, but we'll cross that bridge when we get to it.
So, how is Schego's parser built? At a high level, much the same way as any parser: We represent the program's structure with a tree, known as an AST (Abstract Syntax Tree), with each individual node adhering to this interface:

// base interface for functions needing to accept any kind of AST node
type AstNode interface {
 GetSubNodes() []AstNode
 AddSubNode(AstNode)
 GetType() AstNodeType
 DebugString() string
}

If you recall the first article, I didn't create any interfaces there, so why the change? There will be functions - as we'll see in a moment - that should ideally work with/return any potential combination of AST nodes; when you're parsing a language where literally everything evaluates to some kind of expression, you need flexibility without repeating yourself. Due to Go not having partially abstract classes, I had to split the definition and implementation into two separate places, with the definition being in AstNode and the baseline implementation in another struct, SExp - short for S-Expression, which is what the structure of LISP/Scheme is based off of:

// base struct that all AST node implementations build off of
type SExp struct {
 subNodes []AstNode
}

func (s *SExp) GetSubNodes() []AstNode {
 return s.subNodes
}
func (s *SExp) AddSubNode(node AstNode) {
 s.subNodes = append(s.subNodes, node)
}

Still pretty straightforward, right? Specific nodes implement GetType() and DebugString() to return their proper values, and the basic stuff that should only be written once gets written once. By the way, GetType() returns a value from an enum describing what sort of node the struct is. This isn't normally something we'll need, but for certain optimizations, including the tail-call optimization technique required for all Scheme implementations, it'll be handy to see if some node has other sub-nodes with a certain type.
For the actual implementation of one of the nodes, here's the current implementation of if expressions:

type IfExp struct {
 SExp
}

func NewIfExp(cond AstNode, onTrue AstNode, onFalse AstNode) *IfExp {
 node := new(IfExp)
 node.AddSubNode(cond)
 node.AddSubNode(onTrue)
 node.AddSubNode(onFalse)
 return node
}
func (i IfExp) GetType() AstNodeType {
 return IfNode
}
func (i IfExp) DebugString() string {
 return "IfExp(" + i.subNodes[0].DebugString() + ", " + i.subNodes[1].DebugString() + ", " + i.subNodes[2].DebugString() + ")"
}

Nothing too out of the ordinary here. But what about literals?

type StringLiteral struct {
 SExp
 Value string
}

func NewStringLiteral(value string) *StringLiteral {
 node := new(StringLiteral)
 node.Value = value
 return node
}
func (s StringLiteral) GetType() AstNodeType {
 return StringNode
}
func (s StringLiteral) DebugString() string {
 return "\"" + s.Value + "\""
}

You might think that there's really no need for a string literal - or any other sort of literal, really - to have the ability to contain sub-nodes, which is an ability brought in by the SExp struct. And you'd be right; this is something of a leaky abstraction we've got going on here. For now, I think it's fine. Any code that will actually be messing with the AST after the fact should be calling GetType() in the first place, and due to the way parsing works in Schego, nothing else except post-AST-construction stuff will be modifying the literal except the exact same code that creates the literal from the original tokens in the first place. I'll also admit that I don't know for sure if what I describe here is an idiomatic Go solution to the problem, but it seemed fairly intuitive to me. If there's something more elegant out there, do let me know - I'm all in favor of doing no more work than necessary!
On a side note, structuring our data like this lets us write our tests fairly easily. All we have to do is compare the AST our parser made with one made ourselves in the test, like this:

func TestParseSingleExp(t *testing.T) {
 tokens := LexExp("(+ 5 3)")
 program := ParseTokens(tokens)
 expectedProgram := NewProgram(NewAddExp(NewIntLiteral(5), NewIntLiteral(3)))
 checkProgram(program, expectedProgram, t)
}

But this article was about parsing, right? How do we actually do the parsing bit? We've covered the data structures behind the process, but haven't detailed the actual parsing algorithm itself. Schego's parsing algorithm is what is known as a "recursive descent" parser; essentially, we define our "grammar" - our set of rules that describes our language - recursively. Take the if expression, for instance. We might say it's defined by the word "if," followed by a conditional expression, followed by two more expressions that will be evaluated if the conditional evaluates to true or false.
What do we define as an expression? It could be a lot of different things - a literal, a math formula, a function, another if expression (!), lots of stuff. What's important is that when we're constructing an if expression, we ask the parser to run its general expression-parsing function for the if's expressions. As long as we handle the base cases (literals/identifiers) properly, we can throw all sorts of wonky expressions at our if-parsing routine and it'll handle it just fine due to the recursion.
Additionally, the resultant code is fairly reusable inside the parser itself. If we define a function to parse, say, a list literal, we can simply call that function any time we expect a list literal, and it handles all the nitty-gritty details of validation and list-literal-specific parsing for us. Pretty much a win-win no matter how you look at it. But enough justification. What does Schego's parsing code look like?

// ParseTokens takes tokens and returns an AST (Abstract Syntax Tree) representation
func ParseTokens(tokens []*Token) *Program {
 program := NewProgram()
 currentIndex := 0
 for len(tokens)-1 >= currentIndex {
  node, _ := parseExpression(tokens, &currentIndex)
  program.AddSubNode(node)
 }
 return program
}

// accept checks to see if the current token matches a given token type, and advances if so
func accept(tokens []*Token, expectedType TokenType, currentIndex *int) bool {
 if tokens[*currentIndex].Type == expectedType {
  *currentIndex++
  return true
 }
 return false
}

// grabAccepted returns the token just before current, useful for grabbing the value of an accepted token
func grabAccepted(tokens []*Token, currentIndex *int) *Token {
 return tokens[*currentIndex-1]
}

// expect returns an error if the current token doesn't match the given type
func expect(tokens []*Token, expectedType TokenType, currentIndex *int) error {
 if len(tokens)-1 < *currentIndex {
  return errors.New("Unexpected EOF")
 } else if tokens[*currentIndex].Type != expectedType {
  return errors.New("Unexpected token")
 }
 return nil
}

func parseExpression(tokens []*Token, currentIndex *int) (AstNode, error) {
 // try literals first
 if accept(tokens, TokenIntLiteral, currentIndex) {
  literal := grabAccepted(tokens, currentIndex)
  return NewIntLiteral(bufferToInt(literal.Value)), nil
 } else if accept(tokens, TokenFloatLiteral, currentIndex) {
  literal := grabAccepted(tokens, currentIndex)
  return NewFloatLiteral(bufferToFloat(literal.Value)), nil
 } else if accept(tokens, TokenStringLiteral, currentIndex) {
  literal := grabAccepted(tokens, currentIndex)
  return NewStringLiteral(literal.Value.String()), nil
 } else if accept(tokens, TokenBoolLiteral, currentIndex) {
  literal := grabAccepted(tokens, currentIndex)
  return NewBoolLiteral(literal.Value.Bytes()[0] == 1), nil
 }
 // not a literal, attempt to parse an expression
 lparenError := expect(tokens, TokenLParen, currentIndex)
 if lparenError != nil {
  return nil, lparenError
 }
 // jump past the lparen
 *currentIndex++
 if accept(tokens, TokenOp, currentIndex) {
  // operator-parsing code here
 }
 if accept(tokens, TokenIdent, currentIndex) {
  identToken := grabAccepted(tokens, currentIndex)
  switch identToken.Value.String() {
  case "if":
   // TODO: error-handling here (and throughout the parser!)
   cond, _ := parseExpression(tokens, currentIndex)
   ifTrue, _ := parseExpression(tokens, currentIndex)
   ifFalse, _ := parseExpression(tokens, currentIndex)
   ifNode := NewIfExp(cond, ifTrue, ifFalse)
   expError := closeExp(tokens, currentIndex)
   if expError != nil {
    return nil, expError
   }
   return ifNode, nil
  }
 }
 // no matches?
 return nil, errors.New("Unexpected token")
}

// convenience function to ensure an expression is properly closed
func closeExp(tokens []*Token, currentIndex *int) error {
 rparenError := expect(tokens, TokenRParen, currentIndex)
 if rparenError != nil {
  return rparenError
 }
 *currentIndex += 1
 return nil
}

func bufferToInt(buffer bytes.Buffer) int64 {
 num, _ := binary.Varint(buffer.Bytes())
 return num
}
func bufferToFloat(buffer bytes.Buffer) float64 {
 bits := binary.LittleEndian.Uint64(buffer.Bytes())
 return math.Float64frombits(bits)
}

I've gutted some case-specific stuff like the details of parsing operators in an attempt to get the point across as succinctly as possible, but the overarching algorithm is still there. You can see the recursive calls to parseExpression present in the if-parsing code, and the literal-parsing code that forms the base cases. This basic setup is enough to get this tricky-looking test to pass:

func TestIfExp(t *testing.T) {
 tokens := LexExp("(if (> 6 5) \"true\" \"false\")")
 program := ParseTokens(tokens)
 expectedProgram := NewProgram(NewIfExp(NewGtExp(NewIntLiteral(6), NewIntLiteral(5)), NewStringLiteral("true"), NewStringLiteral("false")))
 checkProgram(program, expectedProgram, t)
}

All the if-parsing code knows is that an if node takes three expressions. It doesn't know anything about what constitutes a valid expression, or how to parse one; it leaves that up to the parseExpression() function, and it all Just Works.

As always, the code is up on GitHub if you want to take a look. We'll be looking at stack-based virtual machines next time, so we can actually do something with the AST we've generated. As I alluded to earlier, at some point in the (near? Distant?) future we'll also take a look at optimization, which is something I haven't really found in any similar how-to-make-a-language series. So that'll be something to look forward to.