The LZ77 Data Compression Algorithm

The LZ77 Data Compression Algorithm

·

4 min read

LZ77 is a lossless data compression algorithm published by Abraham Lempel and Jacob Ziv in 1977. This algorithm form the basis for many variations compression algorithms like LZW, LZSS, etc, which are widely used in many areas including GIF, PNG, ZIP, etc.

The key idea of LZ77 is to find patterns and reduce redundancies. For example, if we already seen vocabulary compression before, when we see this vocabulary next time, we reduce this repetition and only store reference for it.

Now let's see how this reference works. Say we want to compress a string abc-abc+abc with LZ77. And we are at position of second abc.

****-------
43210123456
abc-abc+abc

From above diagram, we can see that the string has 2 parts. The first part is with the *, stands for the part we already seen. The second part is with the -, stands for the part we are plan to handle.

Now what we want to do is to start from index 0, try to match the string we already seen. We can tell that the longest match is index [0,2], matching with index [4,2], with string value abc. After calculating this information, then we store it into a length-distance pair. We can think of this length-distance pair as an array with 3 element. The first element stores the starting index of the matched string. In this example, it is 4. The second element stores the matched length. In this example it is 3. The third element stores the next character after matching. In this example, it is character +. So with this one calculation, we get an array [4, 3, "+"].

So, as you can imagine, we start from the start of the data, for every step, we store a length-distance pair. When the iteration finishes, then we get an array of pairs, which is the encoded data.

Another detail I have not cover is that, there are two buffers to limit the matching process. From above example, we try to match the - part with the * part. The first buffer is called search buffer, stands for the length of the * part. The second one is called look-ahead buffer, stands for the length of the - part. To make the matching execute faster, so in LZ77, these 2 parts are limited.

Now let's take another example to see the whole process.

data: aacaacabcababac

*: search buffer
-: look-ahead buffer

            * * * * * * - - -
            6 5 4 3 2 1 0 1 2
                        a a c aacabcababac [0,0,a]
                      a a c a acabcababac  [1,1,c]
                  a a c a a c abcababac    [3,3,a]
          a a c a a c a b c a babac        [0,0,b]
         aa c a a c a b c a b abac         [3,3,a]
     aacaac a b c a b a b a c              [2,2,c]

As you can see, if there is no match, we just store 1 character with a pair. If there is a match, we can store much more information with one pair item.

OK, with all this in mind, let try to write code. First is the encoding part.


/**
 * 
 * @param data string
 * @param sbl search buffer length
 * @param lbl look-ahead buffer length
 * @returns encoded items
 */
function encode(data: string, sbl: number, lbl: number): Item[] {
    const output: Item[] = [];

    let pos = 0;

    while (pos < data.length) {

        // below double for loop for finding longest match

        let offset = 0;
        let length = 0;
        for (let i = pos - 1; i >= pos - sbl && i >= 0; i--) {

            let j = i;
            for (j = i; j < pos; j++) {
                if (j - i + 1 > lbl) break;
                if (data[j] !== data[pos + j - i]) break;
            }

            if (j - i > length) {
                length = j - i;
                offset = pos - i;
            }
        }

        const c = pos + length < data.length ? data[pos + length] : "";
        output.push([offset, length, c]);
        pos += length + 1;
    }

    return output;
}

Next is the decoding part. Decoding is a lot easier. We only need to iterate the encoded data, and try to expand the result.

function decode(encoded: Item[]): string {

    let output = "";
    for (const [offset, length, c] of encoded) {
        const start = output.length - offset;
        const end = start + length;
        output += output.slice(start, end);
        output += c;
    }

    return output;
}

Now let's make a test.

const data = "aacaacabcababac";
const encoded = encode(data, 6, 3);
const decoded = decode(encoded);
console.log({ data, encoded, decoded }, data === decoded);

/**
{
  data: "aacaacabcababac",
  encoded: [
    [ 0, 0, "a" ],
    [ 1, 1, "c" ],
    [ 3, 3, "a" ],
    [ 0, 0, "b" ],
    [ 3, 3, "a" ],
    [ 2, 2, "c" ]
  ],
  decoded: "aacaacabcababac"
} true
*/