Writing a JSON Parser

Writing a JSON Parser

·

10 min read

In the previous article, we talked about parser combinators. At the end of that article, we also tested our combinators skills by parsing ternary expression. In this article, let's go further to use parser combinators to write a JSON parser.

We will start with this char function just like before. It is a base parser, used to parse an arbitrary character.

const char = c => state => {
    if (state.error) return state;

    const { index, data } = state;
    if (index >= data.length) {
        return {
            ...state,
            error: `char(): not enough data to parse, index: ${index}, char: ${c}`
        };
    }

    const value = data.charAt(index);
    if (value !== c) {
        return {
            ...state,
            error: `char(): parse error, index: ${index}, expect: ${c}, got: ${value}`
        };
    }

    return {
        ...state,
        index: index + 1,
        value
    };
};

Other than this, since we don't want to write all related character literals, so we need another base parser, we call it nchar, meaning not specified character.

const nchar = c => state => {
    if (state.error) return state;

    const { index, data } = state;
    if (index >= data.length) {
        return {
            ...state,
            error: `nchar(): not enough data to parse, index: ${index}, char: ${c}`
        };
    }

    const value = data.charAt(index);
    if (value === c) {
        return {
            ...state,
            error: `nchar(): parse error, index: ${index}, expect not: ${c}, got: ${value}`
        };
    }

    return {
        ...state,
        index: index + 1,
        value
    };
};

Just as the name goes, this function takes in one character, and is able to parse any character other than this input character. Later when we parse JSON string, we will need this.

Now comes the parser combinators. Below is the and and or function, just like before.

const _or = (p1, p2) => state => {
    if (state.error) return state;

    let nextState = p1(state);
    if (!nextState.error) return nextState;

    nextState = p2(state);
    if (!nextState.error) return nextState;

    return {
        ...state,
        error: `error or`,
    };
};


const or = (...parsers) => state => {
    return parsers.reduce((p1, p2) => _or(p1, p2))(state);
};

const _and = (p1, p2) => state => {
    if (state.error) return state;

    let p1State = p1(state);
    if (p1State.error) return p1State;

    let p2State = p2(p1State);
    if (p2State.error) return p2State;

    return {
        ...p2State,
        value: [p1State.value, p2State.value]
    };
};

const and = (...parsers) => state => {
    if (state.error) return state;

    const parser = parsers.reduce((p1, p2) => {
        return _and(p1, p2);
    });
    const nextState = parser(state);
    if (nextState.error) return nextState;

    nextState.value = nextState.value.flat(Infinity);
    return nextState;
};

Other than these, we need some other more advanced parser combinators.

First one is many. Like before, it is like the * in regular expressions and can parse zero or multiple time by specified parser.

// like *
const many = parser => state => {
    if (state.error) return state;

    let value = [];
    let pre = state;
    while (true) {
        state = parser(pre);
        if (state.error) break;
        value.push(state.value);
        pre = state;
    }

    return {
        ...pre,
        error: null,
        value
    };
};

To make things easy, we need a many1 function. It is like the + in regular expressions and can parse one or multiple times by specified parser.

// like +
const many1 = parser => state => {
    if (state.error) return state;
    return and(parser, many(parser))(state);
};

And the last one, we call it or1 function. It is like the ? in regular expressions and can parse zero or 1 times by specified parser.

// like ?
const or1 = parser => state => {
    if (state.error) return state;
    const next = parser(state);
    if (next.error) {
        return state;
    } else {
        return next;
    }
};

Of course, we need to define some utils. A map function to convert parser result. A createInitState function to initialize state. And a lazy function, which is used to handle recursive parser function calls.

const map = (parser, fn) => state => {
    if (state.error) return state;

    state = parser(state);
    if (state.error !== null) return state;

    state.value = fn(state.value);
    return state;
};

function createInitState(data) {
    return {
        data: data,
        index: 0,
        error: null,
        value: [],
    };
}

const lazy = thunk => state => thunk()(state);

OK, with these basic tools, we can begin to parse JSON data.

First, let define a few special character in JSON.

const zero = char("0");
const one = char("1");
const two = char("2");
const three = char("3");
const four = char("4");
const five = char("5");
const six = char("6");
const seven = char("7");
const eight = char("8");
const nine = char("9");

const character = nchar('"');

const doubleQuote = char('"');
const comma = char(",");
const leftBracket = char("[");
const rightBracket = char("]");
const colon = char(":");
const leftCurlyBracket = char("{");
const rightCurlyBracket = char("}");

Then we start with the most simple stuff, booleans and null.

const jtrue = map(
    and(char("t"), char("r"), char("u"), char("e")),
    _ => ({ type: "scalar", value: true })
);

const jfalse = map(
    and(char("f"), char("a"), char("l"), char("s"), char("e")),
    _ => ({ type: "scalar", value: false })
);

const jnull = map(
    and(char("n"), char("u"), char("l"), char("l")),
    _ => ({ type: "scalar", value: null })
);

As you can see, we use map function to name all these tokens as type scalar, because these are not compound values like array or object. Same things go with the numbers and strings.

Now let see how to define a number parser.

const positiveDigit = or(one, two, three, four, five, six, seven, eight, nine);

const digit = or(positiveDigit, zero);

const jnumber = map(
    and(
        or1(char("-")),
        or(
            zero,
            and(
                positiveDigit,
                many(digit)
            )
        )
    ),
    value => ({ type: "scalar", value: Number(value.join("")) })
);

It may looks complicated at first, but it is just number. First, we use or1 to parse the negative character if there is one. Then if first character is zero, then we should step parsing. If not, then we should parse number as many as possible.

Then let's see how to define a string parser.

const characters = many(character);

const jstring = map(
    and(doubleQuote, characters, doubleQuote),
    value => ({ type: "scalar", value: value.slice(1, value.length - 1).join("") })
);

Remember in the beginning, we defined a parser character which used the nchar function to express any character other than double quote. It is defined in this way because in JSON, strings are using double quote as start and end mark. Then we can use and function to express a string, starts and ends with double quotes.

Next since whitespace is very free in JSON, we need to define a parser to handle whitespaces.

const jwhitespace = map(
    many(or(char(" "), char("\n"), char("\t"), char("\r"))),
    value => value.join("")
);

OK, now we have defined quite a few parser for JSON tokens. All these in JSON are called JSON values. Let's define a parser to combine these together.

const jvalue = or(
    jstring,
    jnumber,
    jtrue,
    jfalse,
    jnull
);

Based on this, let's think about how to handle arrays. The format of arrays in JSON is pretty free. But basically, it is surrounded by brackets and filled with values. Typical examples are like below.

[]                  // empty array, ok
[1]                 // array with only 1 element, ok
[1,2,3]             // array with multiple element, ok
[1,true,null]       // array with different types of elements, ok
[  1,    3 ]        // array filled with arbitrary number of whitespaces, ok

OK, let's start with simple ones. Since array elements can be surrounded by whitespaces, so we define a special parser for one element.

const jelement = map(
    and(
        jwhitespace,
        jvalue,
        jwhitespace,
    ),
    value => value[1]
);

Then we consider multiple elements. In JSON, the last element in array should not followed by a comma. So we use or to express this limit.

const jelements = or(
    and(jelement, comma, lazy(() => jelements)),
    jelement,
);

Note that here, since we don't know the number of elements in array, so we use a recursive approach to parse it.

Now we can define our array parser, which only contains two cases: empty array or not empty array. We use or function to express this format and use some map function to make the result clearly.

const jarray = map(
    or(
        map(
            and(leftBracket, jwhitespace, rightBracket),
            _ => ([])
        ),
        map(
            and(leftBracket, jelements, rightBracket),
            value => (value.slice(1, value.length - 1)).filter(v => v !== ",")
        )
    ),
    value => ({ type: "array", value })
);

Now we can make a test.

const state = createInitState("[1,2,3]");
const ast = jarray(state);
console.log(JSON.stringify(ast, null, 2));

/**

{
  "data": "[1,2,3]",
  "index": 7,
  "error": null,
  "value": {
    "type": "array",
    "value": [
      {
        "type": "scalar",
        "value": 1
      },
      {
        "type": "scalar",
        "value": 2
      },
      {
        "type": "scalar",
        "value": 3
      }
    ]
  }
}

 */

En, looks good. But wait, we still have one case hav't not considered: nested arrays.

[1,2,[3,[4],[]]]]

This is pretty complicated. Actually it is not. We have build a good structure, let's see how to handle it now. Remember we defined a parser jelement to express element in array. And we use jvalue to express jelement. So the possible values types of array is jvalue. We already filled jvalue with strings, numbers, booleans, etc. We can just add jarray parser into it and then we can parse arrays recursively.

const jvalue = or(
    lazy(() => jarray),
    jstring,
    jnumber,
    jtrue,
    jfalse,
    jnull
);

Now we can handle nested arrays.

const state = createInitState("[1,2,[3,[4],[]]]]");
const ast = jarray(state);
console.log(JSON.stringify(ast, null, 2));

/**

{
  "data": "[1,2,[3,[4],[]]]]",
  "index": 16,
  "error": null,
  "value": {
    "type": "array",
    "value": [
      {
        "type": "scalar",
        "value": 1
      },
      {
        "type": "scalar",
        "value": 2
      },
      {
        "type": "array",
        "value": [
          {
            "type": "scalar",
            "value": 3
          },
          {
            "type": "array",
            "value": [
              {
                "type": "scalar",
                "value": 4
              }
            ]
          },
          {
            "type": "array",
            "value": []
          }
        ]
      }
    ]
  }
}

 */

OK, that's good. Let's start to handle the last part of JSON: object.

Object is key value pairs. It can contains multiple entries and can be nested as well just like array.

Let's first define a member parser to express one entry in object.

const member = map(
    and(
        jwhitespace,
        jstring,
        jwhitespace,
        colon,
        jelement
    ),
    value => ({ "key": value[1], "value": value[4] })
);

As you can see, it can be surrounded by whitespaces as well. The key must be a string and value can be any types.

Next is members parser to express multiple extries in object.

const members = or(
    and(member, comma, lazy(() => members)),
    member
);

As you can see, it's structure is very similar to jelements for array.

Then we can define the parser to parse an object. Same thing with array, an object can have 2 cases: empty object and not empty object.

const jobject = map(
    or(
        map(
            and(leftCurlyBracket, jwhitespace, rightCurlyBracket),
            _ => ({})
        ),
        map(
            and(leftCurlyBracket, members, rightCurlyBracket),
            value => (value.slice(1, value.length - 1)).filter(v => v !== ",")
        )
    ),
    value => ({ type: "object", value })
);

And don't forget to add jobject into jvalue to make it handle nested objects.

const jvalue = or(
    lazy(() => jarray),
    lazy(() => jobject),
    jstring,
    jnumber,
    jtrue,
    jfalse,
    jnull
);

Finally, we can define out root JSON parser. A JSON can be one any scalar value or an array or an object. So it is just the jvalue parser. To make it handle whitespaces, we use jelement instead of jvalue directly.

const jsonParser = jelement;

Now we can make a test again.


const state = createInitState(`[1,2,true,{"hello": false, "world": [null]}]`);
const ast = jsonParser(state);
console.log(JSON.stringify(ast, null, 2));

/**
{
  "data": "[1,2,true,{\"hello\": false, \"world\": [null]}]",
  "index": 44,
  "error": null,
  "value": {
    "type": "array",
    "value": [
      {
        "type": "scalar",
        "value": 1
      },
      {
        "type": "scalar",
        "value": 2
      },
      {
        "type": "scalar",
        "value": true
      },
      {
        "type": "object",
        "value": [
          {
            "key": {
              "type": "scalar",
              "value": "hello"
            },
            "value": {
              "type": "scalar",
              "value": false
            }
          },
          {
            "key": {
              "type": "scalar",
              "value": "world"
            },
            "value": {
              "type": "array",
              "value": [
                {
                  "type": "scalar",
                  "value": null
                }
              ]
            }
          }
        ]
      }
    ]
  }
}
 */

That's not all of it. We still need to evalute this ast to get the final result. Since we already have a well-structured ast, evaluation is very simple.

const evaluate = node => {
    switch (node.type) {
        case "scalar":
            return node.value;
        case "array":
            return node.value.map(value => evaluate(value));
        case "object":
            return Object.fromEntries(
                node.value.map(
                    value => ([evaluate(value.key), evaluate(value.value)])
                )
            );
        default:
            console.error("invalid type", node);
    }
};

OK, let's make a final test.

const result = evaluate(ast.value);
console.log(JSON.stringify(result, null, 2));

/**
[
  1,
  2,
  true,
  {
    "hello": false,
    "world": [
      null
    ]
  }
]

 */

It works just as expected.

Lastly, I have to say this is not all. Even if JSON is a very clear data format. There are still many edge cases we have not handled like fractions, exponents, special unicode, etc. So please use this code with caution. Good luck!