The LZW Data Compression Algorithm is a dictionary based algorithm based on The LZ78 Data Compression Algorithm. This algorithm is created by Welch and is based on LZ78, so it is called LZW algorithm.
The process is very similar to the LZW. When we finish this article, we can see what the improvement is.
In LZ78, if current substring is abc
and it is not in dictionary, then we will output a tuple with [dic["ab"], "c"]
. So the final output of LZ78 algorithm is a list of these tuples. But in LZW algorithm, we only output the dic["ab"]
, so the final output is just a list of number like [1,2,3...]
.
In LZ78, the dictionary is started as empty and built at the same time with the input data. In LZW, the dictionary is initialized with 0-255 ascii characters and later follow the similar approach with LZ78.
Now let take the same example string BABAABRRR
to see how it is encoded and decoded by LZW algorithm.
0 1 2 3 4 5 6 7 8
B A B A A B R R R
i => 0:
char: "B", pre: "", cur: "B"
cur in dic, pre = cur, move to next
i => 1:
char: "A", pre: "B", cur: "BA"
cur not in dic,
push pre to output: [dic["B"]]
push cur to dic: {..., "BA": 257}
update pre as char: pre: "A"
i => 2:
char: "B", pre: "A", cur: "AB"
cur not in dic,
push pre to output: [dic["B"], dic["A"]]
push cur to dic: {..., "BA": 257, "AB": 258}
update pre as char: pre: "B"
i => 3:
char: "A", pre: "B", cur: "BA"
cur in dic, pre = cur, move to next
...
Note that, after the iteration, the cur
substring with the last character may exists in dictionary as well, so don't forget to check this edge case.
Now let see this encoding process in code.
function encode(text: string): number[] {
const dic = new Map<string, number>();
for (let i = 0; i < 256; i++) {
dic.set(String.fromCharCode(i), i);
}
let pre = "";
const output: number[] = [];
for (const char of text) {
let cur = pre + char;
if (dic.has(cur)) {
// exist in dictionary, move to next iteration
pre = cur;
} else {
// not exit, push pre to output and dictionary, update pre as char
output.push(dic.get(pre) as number);
dic.set(cur, dic.size);
pre = char;
}
}
if (pre !== "") {
output.push(dic.get(pre) as number);
}
return output;
}
Make a test.
const text = "BABAABRRR";
const encoded = encode(text);
// [ 66, 65, 256, 257, 82, 260 ]
Now let's see the decoding process.
For decoding, we need to iterate the encoded list. For each iteration, we convert current code point into string and push it into output. To update the dictionary, we take the previous code point and the first character of current string, make it the next entry in the dictionary.
One edge case we need to handle. Sometimes we may encouter a code point which is not in current dictionary. For example, if the text is abababab
, then encoded result is [97, 98, 256, 258, 98]
. Then when decoding at the the code point 258, it is not in the dictionary yet. In this case, we just concat the previous string with the first character of the previous string and push it to the output.
Now let's see this process in code.
function decode(encoded: number[]): string {
const dic = new Map<number, string>();
for (let i = 0; i < 256; i++) {
dic.set(i, String.fromCharCode(i));
}
// first value as start
let pre = dic.get(encoded[0]) as string;
let output = pre;
for (let i = 1; i < encoded.length; i++) {
const codePoint = encoded[i]
let cur: string;
if (dic.has(codePoint)) {
// if codePoint exists in dictionary, then use it
cur = dic.get(codePoint) as string;
} else {
// if not, use pre + pre[0];
cur = pre + pre.charAt(0);
}
// push it to output
output += cur;
// new dict entry: pre + cur[0]
dic.set(dic.size, pre + cur.charAt(0));
pre = cur;
}
return output;
}
Make a test.
const text = "BABAABRRR";
const encoded = encode(text);
const decoded = decode(encoded);
console.log({ text, encoded, decoded });
// { text: "BABAABRRR", encoded: [ 66, 65, 256, 257, 82, 260 ], decoded: "BABAABRRR" }