Step 2: Parsing and Printing | Language Workshop

Complexity	Long
Previous	Step 1: The Beginning
Next	Step 3: Evaluation

Motivation
Grammars
Syntax Trees
Ambiguity
S-Expressions
The Grammar for S-Expressions
Lexing
Abstract Syntax Tree
Parser
Pretty Printing
3/4ths of a REPL
Task
- Hints
Tests
Extra Challenges

Motivation

In the last step, we wrote a program that could read a line of input as a string, and print that string out to the console. While it’s a good starting ground, it doesn’t do anything particularly interesting.

Our first real step towards a language interpreter is a parser. A parser converts an input string into an abstract syntax tree (or AST); a data structure that represents the internal structure of expressions in our language.

Once we have an AST, we’ll be able to reason about our code in a structured way. In particular, type checking becomes a form of tree traversal, and evaluation becomes a form of tree reduction.

Grammars

All languages, even human languages, adhere to a grammar. A grammar describes the syntax, or the textual representation, of a language; in English, for example, it’s a grammatical error to have a verb follow another verb (eat organise isn’t a valid English sentence).

Programming languages also adhere to a grammar. Here, a grammar says how we’re allowed to combine expressions to build larger expressions. For example, 1 + 2 is composed of the literal expression 1, the operator +, and the literal expression 2.

It’s important to note that we don’t give any semantic meaning to these expressions yet. This means that 1 + 2 does not equal 2 + 1; the two expressions are composed in different ways.

Syntax Trees

Grammars provide syntactic rules for a language Syntax trees provide syntactic structure for a language.

In other words, the act of parsing involves taking an input string, and, according to the rules given by the grammar for your language, trying to produce a syntax tree that represents the input expression.

Ambiguity

Natural languages tend to be ambiguous, meaning that there are sentences that cannot be parsed into exactly one parse tree. For natural languages, this is a useful property, as it enables things like puns and poetry.

To illustrate this, try and figure out how many ways you can parse the English expressions “Superfluous hair remover” and “Programming language implementations”. Draw out the syntax tree resulting from each parse.

For programming languages, however, it’s less useful. When we’re talking to a computer, we want to be as exact about what we mean as possible! As a result, when choosing or designing a grammar for a programming language, we should always make sure that it’s unambiguous.

S-Expressions

To keep things simple for our language, we’ve opted for a very simple grammar, known as S-Expressions. S-Expressions are known for their use in the Lisp family of programming languages, where they are sometimes called sexps. We chose to use S-Expressions in particular as our grammer, since they give us a direct correspondence between the textual representation of our language and our internal representation.

An S-Expression consists of a pair of brackets, that contain a list of “atoms” separated by whitespace. These atoms may be literal values, like numbers or strings, or symbols, which are arbitrary strings of characters. You’re also allowed to have another S-Expression in place of an atom, allowing you to build more complex programs.

In Lisp-like syntax, the first atom is used as a function or operator name, and the following atoms are provided as arguments to the function or operator. Here’s an example of some code written in this style:

(+ 1 2)
(* 3 (- 4 5))
(concat3 "TypeSig" "<3" "you")
(lambda (x) x)
(define factorial (n) (if (= 0 n) 1 (* n (factorial (- n 1)))))

You don’t need to fully understand what the semantics of these mean yet! We’ll build up to that knowledge over the next few steps. The syntax may look quite strange if you’re used to popular languages like Python or JavaScript, but it’s still expressive enough to represent any program.

Exercise 1

Which of the following are valid S-Expressions?

(factorial 1)

This is a valid S-Expression.
factorial 2

This is not a valid S-Expression. If you want to join multiple atoms into one expression, they must be in brackets.
(factorial)

This is a valid S-Expression syntactically. It may not be a semantically valid program, but that doesn’t affect its syntactic validity.
fact

This is a valid S-Expression, consisting of the single atom fact.
!

This is a valid S-Expression, consisting of the single atom !.

We’ve picked S-Expressions other other grammars (say, C-style or Python-like grammars) because it’s very straightforward to convert them into an AST. In fact, they’re already a textual representation of an AST! To see what we mean by this, consider the following snippet:

(+ 1 (* (- 2 3) 4))

This code corresponds to the following AST:

  sexp
 / | \
+  1  sexp
    /  |   \
   *  sexp  4
     / | \
    -  2  3

Each of the bracket pairs turns into a sexp node, and the atoms between the brackets turn into its children.

The Grammar for S-Expressions

In order to build our parser, we’ll need a concrete grammar for it. We’ll build one up piece by piece. First, let’s just write a grammar that recognises some basic literals.

INTEGER ::= /[0-9]+/
SYMBOL  ::= /(^\s|\(|\))+/    // this just means no whitespace or brackets
Literal ::= INTEGER | SYMBOL

Here, we’ve defined two literal types: INTEGERs (whole numbers), and SYMBOLs (any string of characters that doesn’t contain whitespace or brackets). We’ve given regular expressions to describe valid strings that can be be parsed as these literals, although you don’t necessarily need to use regex in your parser implementation.

We’ve then combined these literal types into one grammar rule, called Literal. A Literal can be either an INTEGER, or a SYMBOL. The interpretation of the | character depends on what type of grammar you’re defining: in some grammars, the pipe denotes a non-deterministic choice between either branch; in others, you’d deterministically try the left branch first, then the right. We’ll choose the deterministic type; it’s less general, but it’s easier to work with both on paper and when implementing the code for your parser.

To show how the grammar works, we’ll walk through an example. Let’s say we’re trying to parse the expression hello using this grammar, where our starting symbol is Literal.

A Literal can be either an INTEGER or a SYMBOL. Since we have a deterministic grammar, we’ll try the INTEGER branch first.
An INTEGER must consist only of digits from 0-9, and there must be at least one. Our input is hello, which contains non-digit characters (in fact, it doesn’t contain any characters!), so we can’t apply this rule. We must backtrack.
Backtracking to Literal, the next branch to try is SYMBOL.
A SYMBOL is at least one character that isn’t brackets or whitespace. Our input is hello, which matches this rule. We consume the first character in our input, h.
Because SYMBOL accepts multiple characters, we keep consuming characters from our input until we either find a character that SYMBOL doesn’t accept, or we run out of characters in our input. In both cases, we return back to literal. For our example, we consume the entire input hello, and return back to Literal.
Literal (our starting rule)has nothing left to do since it only consumes one literal at a time, and we’ve consumed all of our input, so we’re done. Our final parse trace is Literal -> SYMBOL.

Currently, our grammar admits the following strings:

1
h3ll0
42
2+2=4
typesig-is_best-sig!
137.5

Exercise 2

Write down the parse trace for each string above.

It won’t admit the following:

hello world!
(123)
1 + 2
(* 3 4)
1+(2*3)

Exercise 3

Why does our grammar not admit any of these?

Let’s update our parse rules to allow multiple expressions. Luckily for us, extending the grammar is straightforward - we just add a new rule to the grammar. We’ll add a new rule, called Program, which will be our starting rule from now on:

INTEGER ::= /[0-9]+/
SYMBOL  ::= /(^\s|\(|\))+/    // this just means no whitespace or brackets
Literal ::= INTEGER | SYMBOL
Program ::= Literal*

The new Program rule refers to Literal, but there’s an * after it. This symbol is called a Kleene star, and it means that we’re looking for zero or more Literals. You might have seen similar syntax before if you’ve studied regular expressions. We could replace it with a + if we wanted one or more Literals, or a ? if we wanted exactly zero or one.

By the way, we’ve not handled the fact that there needs to be whitespace between each literal. We could replace our program rule with something like Program ::= Literal | Literal \s Program, which would do the trick, but for the sake of explanation and clarity, we’ll leave it implicit. We’re going to use a sneaky trick later on to avoid having to worry about it anyways!

Now we admit the strings from above. We’ve implemented a bunch of literals, but we haven’t touched S-Expressions yet! Let’s move on to them now. As mentioned above, an S-Expression is a pair of brackets, containing a list of expressions. We’ll say an expression is either an atom (a literal) or an S-Expression (with the brackets). In other (more formal) words:

Expr    ::= Literal | '(' Expr* ')'

The '(' just means we’re checking our input for ( as a character literal.

We’re now ready to construct our full grammar:

INTEGER ::= /[0-9]+/
SYMBOL  ::= /(^\s|\(|\))+/    // this just means no whitespace or brackets
Literal ::= INTEGER | SYMBOL
Expr    ::= Literal | '(' Expr* ')'
Program ::= Expr*

All we’ve done is combine the S-Expression grammar with the earlier one, and made program admit a list of S-Expressions instead of literals.

Lexing

It’s perfectly possible to build a parser that operates directly on an input string, and that’s the model we’ve operated under so far. But, the task of parsing is usually simpler if we first convert our input string into a list of tokens — strings that are syntactically important to our grammar — and run our parser on that instead of a list of characters. This process is called lexing, and a program that does lexing is called a lexer. Luckily for us, lexing is straightforward for most grammars, and for S-Expressions in particular, there’s a neat hack that does almost the whole thing for us!

Remember how most things in our syntax were whitespace separated? The only real exception to this is the brackets delimiting an S-Expression. Here’s the hack: we can just loop through our input, and wherever we find a bracket, we can surround it with whitespace. For example: (+ 1 (exp 2 3)) becomes ( + 1 ( exp 2 3 ) ). Now all we have to do is split our string on whitespace. Your language probably already has support for this in its standard library: for most languages, it’s something along the lines of String.split(). For Haskell, there’s a function called words. Now we have a list of strings; our example from before has become ["(", "+", "1", "(", "exp", "2", "3", ")", ")"]. As you’ll see later, our parsing logic will be much simpler thanks to our lexer.

Abstract Syntax Tree

As mentioned before, our parser should convert a string into an AST. But how do we represent an AST in code? We’ll make a custom data type for it. Depending on your language, the exact mechanism you use to do so differs; you might use an algebraic data type for something like Haskell, a class for something like Python or Java, or a struct for something like Rust or C. Our AST is going to have a similar structure to the one drawn above. Here’s how we might represent it in Haskell:

data Expr = LInt Integer
          | LSym String
          | SExpr [Expr]
          deriving (Eq, Show)
type Program = [Expr]

It’s a good idea to keep your AST representation as flat as possible. A simple AST makes writing your evaluator much more pleasant. The above example keeps the literals and the S-Expressions at the same level, compared to the grammar, which had a separate rule for literals. Here I’ve defined a type alias so I can call a list of Exprs a Program, but this is just for convenience.

Parser

Now we need to actually write our parser. A parser, as mentioned before, is a function from a string to an AST. This type signature works in the abstract sense, but we’ll need to keep our list of tokens accessible. In OOP languages, you should probably make a class that has the token stream as a variable. For a pure functional language like Haskell, we’ll need to pass around our data explicitly. This means our type actually looks more like [String] -> (Expr, [String]) (remember that we’re representing our token stream as a list of strings).

We’ll use a top-down approach, which means we’ll start by writing a parser for a Program, and work our way down to the smaller types.

A program, as defined by our grammar, is a list of top-level expressions. Thus, our function to parse a program should repeatedly try and parse an expression from the token stream until there aren’t any tokens left. Once we’ve reached the end of the input, we should return the list of expressions we’ve built up.

Thanks to the work we did above with the lexer, each token in our stream can be one of three options: an open bracket (, a close bracket ), or some other string of characters (excluding whitespace and brackets). Not convinced? Go back to the lexer section, and see if you can find an input that would produce something else.

So how do we parse an expression?

As a reminder, our current grammar looks like this:

INTEGER ::= /[0-9]+/
SYMBOL  ::= /(^\s|\(|\))+/    // this just means no whitespace or brackets
Literal ::= INTEGER | SYMBOL
Expr    ::= Literal | '(' Expr* ')'
Program ::= Expr*

The current token at the front of our token stream contains everything we need to know. If it’s a (, then we’re starting a new S-Expression, which we’ll parse separately. The S-Expression parser will handle the closing bracket, so if our token is ) then we’ve found a syntax error in the input. Now we know our current token must be a literal. If we can convert the token from a string into an integer, then we’ll say it’s an integer literal. Otherwise, it must be a symbol.

So far, so good; we can parse literals, and handle at least one form of syntax error. All that’s left is to parse S-Expressions.

The grammar rule for an S-Expression was '(' Expr* ')'. We already know how to parse (, ) and Expr, so all that’s left is the *! We already dealt with a similar situation when parsing the whole program, except here instead of stopping at the end of the input, we want to stop when our current token is a ). If we do reach the end of the input before we find our ), then there must have been another syntax error!

With that, we should now be able to parse everything as dictated by our grammar! We just need to put everything together, by writing a function that calls the lexer on our input, passes the result to our program parser, and then returns the parsed AST.

Pretty Printing

We’ve spent most of this step implementing a way to go from strings to an AST. While we’re at it, we might as well write a function that goes the other way. Such functions are called pretty printers, and are thankfully much easier to implement than parsers!

We’ll make our program pretty printer put each top-level expression on its own line. Then, for each expression:

if it’s an integer literal, convert the integer into a string
if it’s a symbol, then do nothing (it’s already a string)
if it’s an S-Expression, then convert all of the expressions inside it to strings (recursion!), put whitespace between them, and surround them with brackets.

You should write make this function return a string, rather than immediately printing it to the console.

3/4ths of a REPL

We’ve implemented a parser, and a pretty printer. It’s time to update our REPL function from step 1! We should pass the user’s input into the parser, and pass the resultant AST into the pretty printer, and print the output to the console.

Task

Implement a data structure for an AST. If you’re using a functional-style language, you may want to use an algebraic data type to do this; if you’re in an OOP-style language, you may want to use classes instead.

Next, you should write the lexer, which converts an input string into a list of tokens.

Once the lexer is complete, write a parser that converts a list of tokens into the corresponding syntax tree, represented by your AST data structure. It should match the following grammar:

INTEGER ::= /[0-9]+/
SYMBOL  ::= /(^\s|\(|\))+/    // this just means no whitespace or brackets
Literal ::= INTEGER | SYMBOL
Expr    ::= Literal | '(' Expr* ')'
Program ::= Expr*

Your language of choice likely has built in functions to convert a string to an integer (Python has int, Haskell has read, Rust has string.parse::<i32>()). You’ll probably want to use one of these functions to determine if a literal is a symbol or an integer.

Finally, you should also implement a pretty printer that converts the AST back into a string. Hook these both into the REPL from the previous step, by parsing the user’s input, and pretty printing the result.

Hints

The code for this section can be quite difficult to come up with by yourself. If you’ve spent some time trying to come up with a design for your parser, but are getting a little stuck, here’s some pseudocode that might guide you.

Lexer pseudocode

  function lex(input: String) -> List<String>
    input = input.replace("(", " ( ")
    input = input.replace(")", " ) ")

    return input.split(" ")

Parser pseudocode

  enum Expr
    LInt Integer,
    LSym String,
    SExpr (List<Expr>)

  -- Parses an S-Expression (assuming that the initial `(` has been consumed)
  function parseSExpr(tokens: List<String>) -> (Expr, List<String>)
    res: List<Expr> = []

    while (head(tokens) is not ")")
      (e, ts) = parseExpr(tokens)
      res.add(e)
      tokens = ts

    return (SExpr(res), tail(tokens))  -- `tail` drops the closing bracket

  -- Parses a single expression
  function parseExpr(tokens: List<String>) -> (Expr, List<String>)
    match head(tokens)
      case ")" => error("Unexpected closing bracket")
      case "(" => return parseSExpr(tail(tokens))
      case t => if isNumeric(t)
        then return LInt(int(t))
        else return LSym(t)

  -- Parses a program (zero or more expressions)
  function parseProgram(tokens: List<String>) -> List<Expr>
    res: List<Expr> = []

    while (tokens is not [])
      (e, ts) = parseExpr(tokens)
      res.add(e)
      tokens = ts

    return res

Tests

Here’s some test cases that you can use to check if your implementation is along the right lines:

MLTS>
MLTS> 1
1
MLTS> hello
hello
MLTS> ( )
()
MLTS> (+ 1 2)
(+ 1 2)
MLTS> (+ 1 (* 2 3))
(+ 1 (* 2 3))
MLTS> (())
(())
MLTS> (  ( )    )
(())
MLTS> (   foo    bar  )
(foo bar)
MLTS> (* 3
parse: expected closing bracket
MLTS> * 3 4)
parse: unexpected closing bracket

Extra Challenges

These are some extra challenges you can attempt to build your understanding further, and make your interpreter more feature-complete. None of them are required for a fully-functional interpreter. They are listed in order of subjective difficulty; if you struggle on the later ones, you should move on to the next step and come back later. Depending on your language choice, they might be easier or harder than anticipated!

Allow floating point numbers as well as integers.
Add support for comments.
Treat square brackets ([/]) the same as normal brackets/parentheses, so the user can switch between them for clarity.
Add the REPL commands :lex and :parse, which takes an expression and prints the output of the lexer and parser respectively when run on that expression. For example:
```
MLTS> :lex (+ 1 (* 2 foo))
["+", "1", "(", "*", "2", "foo", ")"]
MLTS> :parse (+ 1 (* 2 foo))
SExpr [LSym "+", LInt 1, SExpr [LSym "*", LInt 2, LSym "foo"]]
```
Your output may look a bit different depending on how you implemented the data structure that represents parsed expressions.

For bonus points, make :parse print its output in a tree format!
Extend the grammar, tokeniser and parser to support string literals (strings of characters surrounded by double quotes).
When a syntax error occurs, print out the line of input it occured on, and highlight underneath the unexpected token. The input:
```
(+ 1 2)
(* 3 4))
(- 5 6)
```
should show an error similar to the following:
```
input:2:8: Syntax Error: Unexpected closing bracket:
2 | (* 3 4))
           ^
```
For bonus points, add colour to the output!
The combination of a parser and a pretty printer acts as a code formatter. Add a command line flag that formats the contents of a supplied file, and rewrite the pretty printer to display S-Exprs vertically if they have more than e.g. 3 elements:
```
(foo 1 2 3 4)
(concat3 "hello, " (intToString 42) " world!")
(* 5 6)
```
becomes:
```
(foo 1
     2
     3
     4)
(concat3 "hello, "
         (intToString 42)
         " world!")
(* 5 6)
```