The LZ78 Data Compression Algorithm

We talked about LZ77 compression algorithm before. In this article, let's talk about LZ78 compression algorithm.

Just like LZ77, LZ78 is also created by Abraham Lempel and Jacob Ziv. They published this algorithm in 1978, so its named LZ78.

The key idea of LZ78 is very similar to LZ77, they all try to find if we already saw current data before, if it is, then we replace current data with a reference to the previous data.

Remember in LZ77, when we have current data and try to know if we have saw this before, we use search to do it. We search the data in look-ahead buffer from the search buffer. Well in LZ78, we use a dictionary. Whenever we see data we have not seen before, we store it in a dictionary. So later we can check this dictionary easily.

Now let's take a example to see how it goes.

Say we have a string BABAABRRR. We iterate this string character by character.

So first we get character B. Now the dictionary is empty, so we store this character into dictionary with value 1: dic.set("B", 1). And we output an array with 2 values: [0, B]. The first value 0 stands for no reference, B stands for the end character of current data.

Then we move to second character A. There is no A in dictionary, so the same step as before: store A in dictionary: dic.set("A", 2), and output [0, "A"].

Then we move to third character B. So there is B in dictionary. So we do nothing, move to next character A. So now we have BA, is BA in the dictionary? No. OK, so follow the same process, store BA in dictionary: dic.set("BA", 3), and this time, we output [1, "A"]. This 1 is the value of dic.get("B"), stands for the reference to the previous data, and A is current string's last character, stands for new data.

So now you know this basic 2 cases. Let's see all steps in below.

0 1 2 3 4 5 6 7 8
B A B A A B R R R

i => 0, cur => B,  dic => {"B", 1},  output: [0, "B"] 
i => 1, cur => A,  dic => {"A", 2},  output: [0, "A"] 
i => 2, cur => B
i => 3, cur => BA, dic => {"BA": 3}, output: [1, "A"]
i => 4, cur => A,
i => 5, cur => AB, dic => {"AB": 4}, output: [2, "B"]
i => 6, cur => R,  dic => {"R": 5},  output: [0, "R"]
i => 7, cur => R,
i => 8, cur => RR, dic => {"RR": 6}, output: [5, "R"]

OK, now we know all the process, let see how to do encoding in code.

type Item = [number, string];

function encode(data: string): Item[] {
    const dic = new Map<string, number>();
    dic.set("", 0);

    let encoded: Item[] = [];

    let cur = "";

    for (let i = 0; i < data.length; i++) {
        cur += data[i];

        if (dic.has(cur)) {
            continue;
        };

        dic.set(cur, dic.size);

        const item: Item = [
            // reference previous
            dic.get(cur.slice(0, cur.length - 1)) as number,
            // output last character
            cur[cur.length - 1]
        ]
        encoded.push(item)

        cur = "";
    }

    // handle this edge case
    if (cur.length > 0) {
        encoded.push([dic.get(cur) as number, ""])
    }

    return encoded;
}

Decoding is a lot easier. The key is follow the same process in reverse, building the dictionary and decoding at the same time.

function decode(encoded: Item[]): string {

    let decoded = "";
    const dic = new Map<number, string>();
    dic.set(0, "");

    for (const item of encoded) {
        const pre = dic.get(item[0]) as string;
        decoded += pre;
        decoded += item[1];

        dic.set(dic.size, pre + item[1]);
    }

    return decoded;
}

You can see how clever this algorithm is. We don't need to store the dictionary, the dictionary is built along with decoding process.

And lastly, we make some tests.

function test(data: string) {
    const encoded = encode(data);
    const decoded = decode(encoded);
    console.log({ data, encoded, decoded }, decoded === data)
}

test("ABBCBCABABCAABCAAB")
test("BABAABRRR")
test("AAAAAAAAA")

/**

{
  data: "ABBCBCABABCAABCAAB",
  encoded: [
    [ 0, "A" ],
    [ 0, "B" ],
    [ 2, "C" ],
    [ 3, "A" ],
    [ 2, "A" ],
    [ 4, "A" ],
    [ 6, "B" ]
  ],
  decoded: "ABBCBCABABCAABCAAB"
} true
{
  data: "BABAABRRRA",
  encoded: [
    [ 0, "B" ],
    [ 0, "A" ],
    [ 1, "A" ],
    [ 2, "B" ],
    [ 0, "R" ],
    [ 5, "R" ],
    [ 2, "" ]
  ],
  decoded: "BABAABRRRA"
} true
{
  data: "AAAAAAAAA",
  encoded: [ [ 0, "A" ], [ 1, "A" ], [ 2, "A" ], [ 3, "" ] ],
  decoded: "AAAAAAAAA"
} true

*/

yaox023's blog

yaox023's blog

The LZ78 Data Compression Algorithm