Tip:
Highlight text to annotate it
X
In this tutorial we want to present the strigns match problem,
using the KMP algorithm.
We have two strings, X and Y,
X must be shorter than Y.
We define N as the length of X and M as the length of Y.
We have to find the number of the aparitions of X in Y.
A naive algorithm has a complexity of N*M
This algorithm verifies for every position in X
if it apears in Y.
We need to find a more optimised algorithm,
with a complexity of N+M.
How can we achieve this complexity?
We will define Xi as the longest prefix of X
which ends on position i.
KMP uses an auxiliar function, Pi[i].
Pi[i] is defined as the longest prefix of Xi,
which is also a suffix of Xi.
It is very important to remeber that
Pi[i] MUST be smaller than i.
In other words, we will not consider Xi as being its own prefix.
Respecting the laws we defined Pi by, Pi[1]=0, because Pi[i] must be smaller than i.
Now, for X2, "B" does is not equal with "A", so Pi[2]=0.
We take a look at X3. "A" is similar to "A" so Pi[3] is 1, for the moment.
"BA" does not coincide with "AB", so we have to stop
and on the third position in Pi we will have 1.
Similarly, "C" does not coincide with "A", so Pi[4]=0.
At first sight, the algorithm seems to have a complexity of N^2, but,
due to the way Pi[i] is constructed, we can achieve an algorithm with
a complexity of O(N).
For a better understanding of this algorithm, we will take another example,
a little bit longer.
As we said at the previous example, Pi[1] will be 0.
Now, we already know that by position 1, the string coincides with 0 positions, as a prefix.
That's why we will try to add one more letter, X[i]. This letter will be compared with the one
on position Pi[i-1]+1, but, as we can see in the example, it is also 0.
Similarly, until position 2, we know that Pi[2]=0 and we try to add a new letter, X[3].
X[3] is equal to x[Pi[2]+1], so string's length is increased by 1.
Following the example, "C" does not coincide, so the longest prefix is 0 and we try to add a new letter.
"A" is similar to "A", so Pi[5] is 1.
By this moment, the length of the prefix is 1, so we know that "A" coincides with "A".
We try to add a new letter. "B"="B", so Pi[6] is 2.
"A" coincides with "A", so the length is increased by 1.
Now, we notice that "C" is not similar to "B".
The next position that we have to verify is Pi[3], or 1, to be precise.
We are certain that X[1] coincides with X[7].
"B" from position 2 coincides with the "B" from position 8, so Pi[8] is 2.
"A" is similar to "A", so the length increases and is equal to 3.
"C" coincides with "C", so the length is 4.
Now let's see the algotihm...
N is the length of X. We will keep an iterator, K,
wich represents the length of the prefix until i.
As I mentioned in the examples before, Pi[1] is 0.
We will try, for every position from i 2 to N, to find Pi[i].
We know that, by the previous step, the length is K.
And now, while k is greater than 0 and the the character is different from the one we are trying to add,
K will become Pi[K], due to the way that Pi is developed by.
When we finish the "while" instrucion, we have to verify if the new character matches the one that has already been added.
If X[K+1] is equal to X[i], we increase K. Practically, we increase the length of the prefix.
In the end, Pi[i]=K.
Now let's take a look over the complexity of the algorithm...
i increases from 2 to N, so for now the complexity is O(N).
We can notice that K will be smaller and smaller, until it reaches the value 0.
When K is 0, we can not continue the "while" command.
So, during the algorithm, it will be modified N times, at maximum.
In conclusion, although this while seems to have an O(N) complexity, the total complexity is O(N).
At this step we increase the iterator K. It will be increased for maximum N times.
Considering these facts, the complexity of this part of the algorithm is O(N).
Now, in order to completely solve the problem, we will use Pi, that we constructed previuosly, and,
in a similar way, we will construct d[i].
d[i] represents the longer prefix of Xi, which is a suffix for Yi.
We have to mention that, in this case, d[i] doesn't need to be smaller than i.
We take a look over the first position. "A" coincides with "A", so d[1] is 1.
Similarly, "B" coincides with "B", "A" coincides with "A" and "C" coincides with "C".
"A" coincides with the first "A", so d[5]=1.
d[6] becomes 2, d[7] becomes 3 and d[8] becomes 2.
We continue... d[9] becomes 3 and d[10] becomes 4.
The number of matches is represented by the number of apearances of X's length (N) in d.
Similar to the construction of Pi, we will construct d, using Pi.
"A" coincides with "A" and "B" coincides with "B", so d[1]=1 and d[2]=2.
We will take a look to the following positions... "A" coincides with "A" so d[3]=3.
"C" coincides with "C", so d[4]=4.
Now, for "A" we do not have any other letter to compare with. That's why we will go to position Pi[4]+1,
which equals 1. "A" coincides with "A" so d[5]=1.
"B" coincides with "B" and "A" coincides with "A", so d[6]=2 and d[7]=3.
"B" doesn't coincide with "C" so we go to position Pi[3]+1. In this case "B" coincides with "B" so d[8]=2.
d[9]=3 and d[10]=4 because "A" coincides with "A" and "C" coincides with "C".
The construction of d will be realised almost identically with the construction of Pi.
More precise, M is the length of y. We construct Pi, which will help us constructing d.
K is an iterator that is initially 0.
We will take i from 1 to M.
As long as K>0 (the prefix isn't 0) and the new letter that we want to add is different from
the next letter of the prefix, K takes the value of Pi[K].
We verify if the two letters are equal and, if so, we increase K by 1.
Finally, d[i] takes the value of K.
The complexity, as we prooved at the previous step, is O(M).
The total complexity of the algorithm is O(N+M).
For more tutorials access:
http/sites.google.com/site/centrulinfo1