Knuth–Morris–Pratt(KMP) Algorithm

Knuth–Morris–Pratt(KMP) Algorithm

·

4 min read

Knuth–Morris–Pratt(KMP) algorithm is a very effective string matching algorithm which has a time complexity of O(n).

Before we get to details of this algorithm, let's refresh our memory to see how the brute force way works.

Say we have two string haystack and needle, we want to find needle in haystack.

haystack: aaaxaaaa
needle:   aaaa

The brute force way is to iterate the haystack and compare each character in each iteration.

aaaxaaaa
aaaa
 aaaa
  aaaa
   aaaa
    aaaa

So the time complexity is O(mn), for m is the length of needle, and n is the length of haystack.

Knuth–Morris–Pratt algorithm is an approach based on this brute force way. Now let's see what the improvement is.

In above example, when we find one unmatch character, we move one character to the right, and start matching from the beginning again.

   0 1 2 3 4 5 6 7
   a a a x a a a a
1. a a a a
2.   a a a a

But when we move to character x and finds an unmatch, we already have some valuable information. That information is, we know that first 3 character aaa match with the first 3 character of needle. We could somehow make use of this information to speed up the process.

Let's use i to express current start indice of haystack and j to express current start indice of needle. We know now that if i is 0, then we will bump into an unmatch, so we need to move i to the right by 1, and compare aaxa with needle again.

  i
0 1 2 3 4 5 6 7
a a a x a a a a

j   
0 1 2 3
a a a a

Forget about x for the time being, just consider the first 3 characters first. The first time, we already compare first 3 characters and there is a match. And the second time, we need to compare the haystack.slice(1,3) with needle.slice(0,2). Do you see the pattern here? haystack.slice(1,3) can be seen as needle.slice(1,3) because of the previous match. So basically, we are compare the needle's prefix (0,2) with its suffix (1,3) now.

   0 1 2
   a a a    x a a a a a
1. a a a    a
2.   a a    a

If a string's prefix is equal to its suffix, then we call this happy prefix. You can checkout to my previous article Longest Happy Prefix to see how to calculate happy prefix.

So if we can somehow know this is the needle's happy prefix, then we don't need to compare this 2 characters again. We can just move j to needle's longest happy prefix and then compare j with i directly.

      i
0 1 2 3 4 5 6 7
a a a x a a a a

    j   
0 1 2 3
a a a a

i don't need to be moved to index 1
j can be moved to longest happy prefix, index 2

If the longest happy prefix doesn't match again, follow the same process, move j to second longest happy prefix index, until j is 0.

      i
0 1 2 3 4 5 6 7
a a a x a a a a

  j   
0 1 2 3
a a a a

If j is 0, and there is still no match, then move i directly to the next indice, we don't need to move i to 1 ever.

        i
0 1 2 3 4 5 6 7
a a a x a a a a

j   
0 1 2 3
a a a a

In this way, the indice i will never come back, we just move j back and forth.

This is all the details. Now let's see the code.

function strStr(haystack: string, needle: string): number {
    if (needle === "") return 0;

    // calculate happy prefix for needle
    function getlps(s: string) {
        let i = 1;
        let j = 0;
        let lps = Array.from({ length: s.length }, _ => 0);
        while (i < s.length) {
            if (s.charAt(i) === s.charAt(j)) {
                lps[i++] = ++j;
            } else {
                if (j === 0) {
                    lps[i++] = 0;
                } else {
                    j = lps[j - 1];
                }
            }
        }
        return lps;
    }
    const lps = getlps(needle);

    let i = 0;
    let j = 0;
    while(i < haystack.length) {
        if (haystack.charAt(i) === needle.charAt(j)) {
            // if current character match, move forward
            i += 1;
            j += 1;
        } else {
            if (j === 0) {
                // if j is 0, move i forward directly
                i += 1;
            } else {
                // or move j back to the longest happy prefix index for already matched part
                j = lps[j-1]
            }
        }

        if (j === needle.length) {
            return i - needle.length;
        }
    }

    return -1;
};

Don't forget to use leetcode Implement strStr() problem to test if the code is right.