Rabin Karp Algorithm

Rabin Karp Algorithm

·

6 min read

Rabin Karp Algorithm is a string matching algorithm, which uses the rolling hash idea to help comparing two strings. Let's see how the algorithm works.

Let's use the leetcode problem Implement strStr() as a test case. What we need to do is to find the needle string in haystack string.

Say the haystack is abcdefg and the needle is cde.

Let's first see the brute force way to do this. To brute force it, we need to iterate the haystack string and compare part of it with needle every time. Its process like below.

haystack: abcdefg
needle: cde

abcdefg
abc         => cde
 bcd        => cde
  cde       => cde => match return
   def      => cde
    efg     => cde

As you can see, if the haystack's length is m, and the needle's length is n, then time complexity should be O(m*n).

In the brute force approach, in every iteration, we need to iterate the needle string again to compare it with part of haystack string. The idea of Rabin Karp Algorithm is to make this part more efficient by using the idea of hash.

Imagine we don't need to compare the needle with the part of the haystack every time, we just compare the hash of them, then the process should be way faster.

hash(needle) === hash(haystack.slice(0, needle.length))

Only hash them is not enough, to calculate the hash, we still need to loop the length of needle every time. The time complexity is still the same.

haystack: abcdefg

hash(abc)
hash(bcd)
hash(cde)
hash(def)
hash(efg)

So the next idea is to using rolling hash. The idea is if we could design a special hash function, which could use the previous hash to calculate current hash, then we can reduce the hash calculation time complexity from O(n) to O(1).

To show the idea of rolling hash, let use a very simple rolling hash function first.

First, we map every character to a number, say a is 1, b is 2 and so on. Then when just make the sum of values as the hash values.

1 2 3 4 5 6 7 ... 26
a b c d e f g ... z

hash(abc) => a+b+c => 1+2+3
hash(bcd) => b+c+d => 2+3+4
...

This simple hash function has a special feature: we can calcuate the next hash easily based the previous one.

hash(abc) => a+b+c => 1+2+3
hash(bcd) => b+c+d => 2+3+4

hash(bcd) = hash(abc) - a + d

So now the time complexity of comparison part becomes O(1). Does it means that the whole time complexity becomes O(m)? Actually, the answer is maybe. As you may know about the hash function, there is a concept of hash collision. Hash collision means that there are some cases that even the hashes are equal, the actual two strings may not be equal. Take the above simple sum hash function as an example, hash(abc) should have the same hash value with hash(bca). So in this case, hash collision happens. When it happens, we need to compare these two strings one character by one character again.

So in best case, the time complexity is O(m) and in worse case, the time complexity becomes O(m*n) again.

Before we move on, let's see the code using this simple sum hash function to solve the problem.

function strStr(haystack: string, needle: string): number {
    if (needle.length > haystack.length) return -1;

    const base = "a".charCodeAt(0) - 1;

    let needleHash: number = 0;
    for (let i = 0; i < needle.length; i++) {
        needleHash += (needle.charCodeAt(i) - base);
    }

    let haystackHash: number = 0;
    for (let i = 0; i < haystack.length; i++) {
        haystackHash += (haystack.charCodeAt(i) - base);
        if (i >= needle.length) {
            haystackHash -= (haystack.charCodeAt(i - needle.length) - base);
        }

        // if hash equals, check every character again
        if (needleHash === haystackHash) {
            let isEqual = true;
            let start = i - needle.length + 1;
            for (let j = 0; j < needle.length; j++) {
                if (haystack.charCodeAt(start + j) !== needle.charCodeAt(j)) {
                    isEqual = false;
                    break;
                }
            }
            if (isEqual) return start;
        }
    }

    return -1;
};

Now you may thinking, if we could design a more sophisticated hash function, which reduce hash collision, then we can make this algorithm's time complexity close to O(m).

One popular working hash function is the polynomial rolling hash. The idea is, not just sum all values, but make every character has itw one weight. Let's see a example below.

hash(abc) => a*26^2 + b*26^1 + c*26^0

As you can see, because we have 26 unique characters(a-z), so make the weight 26. Then make the every character have different weights according to its order. In this way, the hash collision should be reduced substantially.

Next thing we need to know how to do rolling hash using this hash function. Still, let's see an example.

hash(abc) => a*26^2 + b*26^1 + c*26^0
hash(bcd) =>          b*26^2 + c*26^1 + d*26^0

So
hash(bcd) = (hash(abc) - a*26^2) * 26 + d*26^0

As you can see, we can do rolling hash easily with this hash function too. Let's see this in code.

function strStr(haystack: string, needle: string): number {
    if (needle.length > haystack.length) return -1;

    const base = "a".charCodeAt(0) - 1;

    let needleHash: number = 0;
    for (let i = 0; i < needle.length; i++) {
        needleHash *= 26;
        needleHash += needle.charCodeAt(i) - base;
    }

    let haystackHash: number = 0;
    for (let i = 0; i < haystack.length; i++) {
        if (i >= needle.length) {
            haystackHash -= (haystack.charCodeAt(i - needle.length) - base) * Math.pow(26, needle.length - 1);
        }
        haystackHash *= 26;
        haystackHash += (haystack.charCodeAt(i) - base);

        if (haystackHash === needleHash) return i - needle.length + 1;
    }

    return -1;
};

However, there is still one problem left. As you can see, there is a lot multiplication above. This may cause number overflow. So normally, we should apply modulo operation to every step to avoid overflow. Applying modulo operation with rolling hash requires some mathematical derivation. Refer to my previous article Modulo Operation, if it helps.

function strStr(haystack: string, needle: string): number {
    if (needle.length > haystack.length) return -1;

    const base = "a".charCodeAt(0) - 1;

    // const mod = Math.pow(10, 9) + 7;  // big mod, less collision
    const mod = 113;

    let needleHash: number = 0;
    for (let i = 0; i < needle.length; i++) {
        // apply mod for every step
        needleHash = ((needleHash * 26) % mod + (needle.charCodeAt(i) - base)) % mod;
    }

    let haystackHash: number = 0;
    for (let i = 0; i < haystack.length; i++) {

        // subtract leading character
        if (i >= needle.length) {
            // apply mod for every step
            haystackHash -= (((haystack.charCodeAt(i - needle.length) - base) * Math.pow(26, needle.length - 1)) % mod);
        }

        // apply mod for every step
        haystackHash = ((haystackHash * 26) % mod + (haystack.charCodeAt(i) - base)) % mod

        // here, hash may become negative, make it positive
        if (haystackHash < 0) {
            haystackHash += mod;
        }

        // hash equals, compare character one by one in case of hash collision
        if (haystackHash === needleHash) {
            let isEqual = true;
            let start = i - needle.length + 1;
            for (let j = 0; j < needle.length; j++) {
                if (haystack.charCodeAt(start + j) !== needle.charCodeAt(j)) {
                    isEqual = false;
                    break;
                }
            }
            if (isEqual) return start;

        }
    }

    return -1;
};