Writing a Recursive Descent Parser

Writing a Recursive Descent Parser

·

6 min read

Writing a Recursive Descent Parser

I have written an article about parser in the past. In that article I implemented a json parser using the parser combinator technique.

In this article, I will use leetcode problem Basic Calculator II as an example to show you how to write a parser with the idea of recursive descent.

As we all know, for a normal parser, we first need to implement a scanner to convert source code characters into tokens. For a problem as basic calculator, we only have four operators and number token. Let define the token type and token class first.

type TokenType = "NUMBER" | "PLUS" | "MINUS" | "MULTIPLY" | "DIVIDE";

class Token {
    public type: TokenType;
    public lexeme: string;

    constructor(type: TokenType, lexeme: string) {
        this.type = type;
        this.lexeme = lexeme;
    }
}

Then we need to implement a scanner, which takes in the source code as the form of characters and output a token array. The logic of scanner is pretty simple, we consume character by character, and create tokens along the way.

class Scanner {
    private source = "";
    private start = 0;
    private current = 0;
    private tokens: Token[] = [];

    constructor(source: string) {
        this.source = source;
    }

    public scanTokens(): Token[] {
        while (!this.isAtEnd()) {
            this.start = this.current;
            this.scanToken();
        }
        return this.tokens;
    }

    private scanToken(): void {
        const char = this.advance();
        switch (char) {
            case "+":
                this.tokens.push(new Token("PLUS", char));
                break;
            case "-":
                this.tokens.push(new Token("MINUS", char));
                break;
            case "*":
                this.tokens.push(new Token("MULTIPLY", char));
                break;
            case "/":
                this.tokens.push(new Token("DIVIDE", char));
                break;
            case " ":
                break;
            default:
                if (/\d/.test(char)) {
                    while (!this.isAtEnd() && /\d/.test(this.peek())) {
                        this.advance();
                    }
                    this.tokens.push(
                        new Token(
                            "NUMBER",
                            this.source.slice(this.start, this.current)
                        )
                    );
                } else {
                    throw new Error(`invalid char: ${char}`);
                }
        }
    }

    private advance() {
        return this.source.charAt(this.current++);
    }

    private peek() {
        return this.source.charAt(this.current);
    }

    private isAtEnd() {
        return this.current >= this.source.length;
    }
}

Now we have this simple scanner, let make a simple test.

const s = "1+22*333-4444/55555";
const scanner = new Scanner(s);
console.log(scanner.scanTokens())

// [
//     Token { type: "NUMBER", lexeme: "1" },
//     Token { type: "PLUS", lexeme: "+" },
//     Token { type: "NUMBER", lexeme: "22" },
//     Token { type: "MULTIPLY", lexeme: "*" },
//     Token { type: "NUMBER", lexeme: "333" },
//     Token { type: "MINUS", lexeme: "-" },
//     Token { type: "NUMBER", lexeme: "4444" },
//     Token { type: "DIVIDE", lexeme: "/" },
//     Token { type: "NUMBER", lexeme: "55555" }
//   ]

As you can see, we have a valid token array as expected.

The next step is to write the parser whick takes in these tokens and output a tree structure called abstract syntax tree(ast). Why do we need such a tree? Well, take this basic calculator as an example, multiplication and division have higher precedence than addition and subtraction. So to calculate the result, we need to evaluate the multiplication and division, then the addition and subtraction.

To do this, we use the approach called recursive descent. With this approach, we first need to write the rules for different expressions, then we write code according to the rules. The rules is written as a recursive way from the top(lowest precedence) to the bottom(highest precedence), so this approach is called recursive descent.

So for basic calculator, we actually only have 3 rules. The top one is the addition and the subtraction. The second one is the multiplication and division because of its higher precedence. The last one is the literal value. This is the terminal value, it has the highest precedence and we will not go deeper.

l3 => l2 ( ( + | - ) l2 )*
l2 => l1 ( ( * | / ) l1 )*
l1 => number

For this tree structure, we need to first define the the node structure of the tree. In basic calculator, we only have 2 type of nodes: Literal and Binary operation. Let's define them.


class Expr {
    toString(): string {
        throw new Error("not implemented");
    }
}

class LiteralExpr extends Expr {
    private value: number;
    constructor(value: number) {
        super();
        this.value = value;
    }
    toString(): string {
        return `${this.value}`;
    }
}

class BinaryExpr extends Expr {
    private leftExpr: Expr;
    private operator: Token;
    private rightExpr: Expr;
    constructor(leftExpr: Expr, operator: Token, rightExpr: Expr) {
        super();
        this.leftExpr = leftExpr;
        this.operator = operator;
        this.rightExpr = rightExpr;
    }
    toString(): string {
        return `( ${this.operator.lexeme} ${this.leftExpr} ${this.rightExpr} )`;
    }
}

As you can see in the toString method, we print the binary expression as the form of polish notation so we can see this precedence more clearly.

Then let's implement our parser. As we just said, the parser logic strictly follows the rules we defined above. We just need to translate them into code.

class Parser {
    private tokens: Token[];
    private current = 0;

    constructor(tokens: Token[]) {
        this.tokens = tokens;
    }

    // parse method, call l3(), then call, l2(), then call l1()
    parse(): Expr {
        return this.l3();
    }

    l3(): Expr {
        let expr = this.l2();
        while (!this.isAtEnd() && this.match("PLUS", "MINUS")) {
            const operator = this.advance();
            const rightExpr = this.l2();
            expr = new BinaryExpr(expr, operator, rightExpr);
        }
        return expr;
    }

    l2(): Expr {
        let expr = this.l1();
        while (!this.isAtEnd() && this.match("MULTIPLY", "DIVIDE")) {
            const operator = this.advance();
            const rightExpr = this.l1();
            expr = new BinaryExpr(expr, operator, rightExpr);
        }
        return expr;
    }

    l1(): Expr {
        if (this.match("NUMBER")) {
            const token = this.advance();
            return new LiteralExpr(parseInt(token.lexeme));
        }
        throw new Error("invalid expression");
    }

    private advance() {
        return this.tokens[this.current++];
    }

    private peek(): Token {
        return this.tokens[this.current];
    }

    private match(...types: TokenType[]) {
        for (const type of types) {
            if (this.peek().type === type) {
                return true;
            }
        }
        return false;
    }

    private isAtEnd() {
        return this.current >= this.tokens.length;
    }
}

OK, let make a simple test.

const s = "1+22*333-4444/55555";
const scanner = new Scanner(s);
const tokens = scanner.scanTokens();

const parser = new Parser(tokens);
const ast = parser.parse();

console.log(ast.toString())
// ( - ( + 1 ( * 22 333 ) ) ( / 4444 55555 ) )

Well, it seems pretty good.

Now we have this tree structure, the last step is to evaluate the tree the calculate the values. Because this tree structure already express the precedence, we can just use a post order traversal to evaluate values along the way. A more simple solution is to implement the evaluation logic inside each node then evaluate from the root to evaluate the whole tree recursively.


class Expr {
    toString(): string {
        throw new Error("not implemented");
    }
    toValue(): number {
        throw new Error("not implemented");
    }
}

class LiteralExpr extends Expr {
    private value: number;
    constructor(value: number) {
        super();
        this.value = value;
    }

    toString(): string {
        return `${this.value}`;
    }

    toValue(): number {
        return this.value;
    }
}

class BinaryExpr extends Expr {
    private leftExpr: Expr;
    private operator: Token;
    private rightExpr: Expr;

    constructor(leftExpr: Expr, operator: Token, rightExpr: Expr) {
        super();
        this.leftExpr = leftExpr;
        this.operator = operator;
        this.rightExpr = rightExpr;
    }

    toString(): string {
        return `( ${this.operator.lexeme} ${this.leftExpr} ${this.rightExpr} )`;
    }

    toValue(): number {
        switch (this.operator.type) {
            case "PLUS":
                return this.leftExpr.toValue() + this.rightExpr.toValue();
            case "MINUS":
                return this.leftExpr.toValue() - this.rightExpr.toValue();
            case "MULTIPLY":
                return this.leftExpr.toValue() * this.rightExpr.toValue();
            case "DIVIDE":
                return Math.floor(this.leftExpr.toValue() / this.rightExpr.toValue());
            default:
                throw new Error("invalid operator " + this.operator.lexeme);
        }
    }
}

As you can see, we add the toValue method for each node. Now let's make the final test by evaluating from the root node.

const s = "1+22*333-4444/55555";
const scanner = new Scanner(s);
const tokens = scanner.scanTokens()

const parser = new Parser(tokens);
const ast = parser.parse();

console.log(ast.toValue());
// 7327