KMP字符串匹配算法

转载

Java海洋 2022-09-23 20:44:38 博主文章分类：数据结构与算法

前言

前面博文分别介绍了字符串匹配算法《朴素算法》、《Rabin-Karp算法》和《有限自动机算法》；本节介绍Knuth-Morris-Pratt字符串匹配算法(简称KMP算法)。该算法最主要是构造出模式串pat的前缀和后缀的最大相同字符串长度数组next，和前面介绍的《朴素字符串匹配算法》不同，朴素算法是当遇到不匹配字符时，向后移动一位继续匹配，而KMP算法是当遇到不匹配字符时，不是简单的向后移一位字符，而是根据前面已匹配的字符数和模式串前缀和后缀的最大相同字符串长度数组next的元素来确定向后移动的位数，所以KMP算法的时间复杂度比朴素算法的要少，并且是线性时间复杂度，即预处理时间复杂度是O(m)，匹配时间复杂度是O(n)。

Java中的indexof()方法用的蛮力法，不过有优化

KMP字符串匹配算法实现

KMP算法预处理过程

首先介绍下前缀和后缀的基本概念：

前缀：字符串中除了最后一个字符，前面剩余的其他字符连续构成的字符或字符子串称为该字符串的前缀；

后缀：字符串中除了首个字符，后面剩余的其他字符连续构成的字符或字符子串称为该字符串的后缀；

注意：空字符是任何字符串的前缀，同时也是后缀；

例如：字符串“Pattern”的前缀是：“P”“Pa”“Pat”“Patt”“Patte”“Patter”；

后缀是：“attern”“ttern”“tern”“ern”“rn”“n”；

在进行KMP字符串匹配时，首先要求出模式串的前缀和后缀的最大相同字符串长度数组next；下面先看下例子模式串pat=abababca的数组next：其中value值即为next数组内的元素值，index是数组下标标号；注意：next[i]是pat[0..i]的最长前缀和后缀相同的字符串，包括当前位置i的字符。之所以是这样，是因为这里讲解的KMP算法是最基本的，没有经过优化的，若要进行优化，则必须优化next数组，下面会介绍优化数组。

[cpp] view plain copy

char: | a | b | a | b | a | b | c | a |
index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
value: | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |

－

" a "的前缀和后缀都为空集，最大相同字符子串的长度为 0 ；

－"

ab "的前缀为[ a ]，后缀为[ b ]，不存在最大相同字符子串，则长度为 0 ；

－"

aba "的前缀为[ a, ab ]，后缀为[ ba, a ]，最大相同字符子串[a]的长度为1；

－"

abab "的前缀为[a, ab, aba]，后缀为[bab, ab, b]，最大相同字符子串[ab]的长度为 2 ；

－"

ababa "的前缀为[ a, ab, aba, abab ]，后缀为[ baba, aba, ba, a ]，最大相同字符子串[ aba ]的长度为 3 ；

－"

ababab "的前缀为[ a, ab, aba, abab, ababa ]，后缀为[ babab, abab, bab, ab, b ]，最大相同字符子串[ abab ]的长度为 4 ；

－"

abababc "的前缀为[a, ab, aba, abab, ababa，ababab]，后缀为[bababc, ababc, babc, abc, bc, c]，不存在最大相同字符子串，则长度为0。

－"abababca"的前缀为[a, ab, aba, abab, ababa，ababab，abababc]，后缀为[bababca, ababca, babca, abca, bca, ca,a]，最大相同字符子串[a]的长度为1。

模式串的前缀和后缀的最大相同字符串长度数组next的递推求解

已知next[0..i-1]，求出next[i]：

若P[i]=P[len]，则next[i]=++len；i++继续查找下一个字符的next元素值；
若P[i]！=P[len]，则分为两步：

若len！=0，递归查找，即比较next前一个元素值所在位置的字符P[next[len-1]]与P[i]，因此i不变，而len=next[len-1]；
若len=0，则当前字符的next元素值为0，即next[i]=0；此时len不变，i++查找下一个位置字符的next元素值；

下面给出求解模式串 next 数组的代码：

[cpp] view plain copy

void computeNextArray(const string &pat, int M, int *next)
{
int len = 0; // lenght of the previous longest prefix suffix
int i = 1;
// next[0] is always 0
// the loop calculates next[i] for i = 1 to M-1
while(i < M)
{
if(pat[i] == pat[len])
{
len++;
next[i] = len;
i++;
}
else // (pat[i] != pat[len])
{
if( len != 0 )
// This is tricky. Consider the example AAACAAAA and i = 7.
len = next[len-1];
// Also, note that we do not increment i here
}
else // if (len == 0)
{
next[i] = 0;
i++;
}
}
}
}

KMP算法字符串匹配过程

若当前对应字符匹配成功即pat[j] = txt[i]，则i++，j++，继续匹配下一个字符；
若当前对应字符匹配失败即pat[j] ！= txt[i]，则分为两步：

若模式串当前字符的位置j！=0时，此时，模式串相对于文本字符串向后移动j - next[j-1]位，文本字符串当前位置i不变，更新模式串当前字符的位置j = next[j-1]，继续匹配字符；
若模式串当前字符的位置j=0时，此时只需更新文本字符串的当前位置i++，其他不变，继续匹配下一个字符；

源码实现如下：

[cpp] view plain copy

void KMPSearch(const string &pat, const string &txt)
{
int M = pat.length();
int N = txt.length();
// create next[] that will hold the longest prefix suffix values for pattern
int *next = (int *)malloc(sizeof(int)*M);
int j = 0; // index for pat[]
// Preprocess the pattern (calculate next[] array)
computeNextArray(pat, M, next);
int i = 0; // index for txt[]
while(i < N)
{
if(pat[j] == txt[i])
{
j++;
i++;
}
if (j == M)
{
"Found pattern at index:"<< i-j<<endl;
j = next[j-1];
}
// mismatch after j matches
else if(pat[j] != txt[i])
{
// Do not match next[0..next[j-1]] characters,
// they will match anyway
if(j != 0)
j = next[j-1];
else
i = i+1;
}
}
// to avoid memory leak
}

下面举例，模式串 p at = “ abababca ” ，输入文本字符串 text = “ bacbababaabcbab ”。

由上面可知next表元素值如下

[cpp] view plain copy

char: | a | b | a | b | a | b | c | a |
index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
value: | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |

下面是匹配过程

第一次匹配成功的字符为相对应字符a，由于模式串下一个字符b与文本字符c不匹配，且j=1、已匹配字符数为j=1，next[j-1]=0；所以下一次向后移动的位数为j-next[j-1]=1-0=1；文本字符串当前位置i不变，更新模式串当前字符的位置j = next[j-1]=0；

[cpp] view plain copy

bacbababaabcbab
|
abababca

第二次匹配成功的是字符ababa；由于模式串下一个字符b与文本字符a不匹配，且j=5、已匹配字符数j=5、next[j-1]=3；所以下一次向后移动的位数为j-next[j-1]=5-3=2；即忽略两位文本字符；文本字符串当前位置i不变，更新模式串当前字符的位置j = next[j-1]=3；

[cpp] view plain copy

bacbababaabcbab
|||||
abababca

经过上一步向后移动后的字符匹配为下面所示；由于模式串下一个字符 b 与文本字符 a 不匹配，且 j=3 、已匹配字符数 j=3 、 next[j-1]=1 ；则下一次匹配是向后移动位数为j-next[j-1]=3-1=2；即忽略两位文本字符；文本字符串当前位置i不变，更新模式串当前字符的位置j = next[j-1]=1；

[cpp] view plain copy

// x denotes a skip
bacbababaabcbab
xx|||
abababca

经过前一步的移动后得到下面的匹配；由于模式串下一个字符 b 与文本字符 a 不匹配，且 j=1 、已匹配字符数 j=1 、 next[j-1]=0 ；则下一次匹配是向后移动位数为j-next[j-1]=1-0=1；但是此时，模式串的字符长度大于待匹配的文本字符长度，所以，模式串匹配失败，即在文本字符串中不存在与模式串相同的字符串；

[cpp] view plain copy

// x denotes a skip
bacbababaabcbab
xx|
abababca

完整程序：

[cpp] view plain copy

#include<iostream>
#include<string>
#include<stdlib.h>
using namespace std;
void computeNextArray(const string &pat, int M, int *next);
void KMPSearch(const string &pat, const string &txt)
{
int M = pat.length();
int N = txt.length();
// create next[] that will hold the longest prefix suffix values for pattern
int *next = (int *)malloc(sizeof(int)*M);
int j = 0; // index for pat[]
// Preprocess the pattern (calculate next[] array)
computeNextArray(pat, M, next);
int i = 0; // index for txt[]
while(i < N)
{
if(pat[j] == txt[i])
{
j++;
i++;
}
if (j == M)
{
"Found pattern at index:"<< i-j<<endl;
j = next[j-1];
}
// mismatch after j matches
else if(pat[j] != txt[i])
{
// Do not match next[0..next[j-1]] characters,
// they will match anyway
if(j != 0)
j = next[j-1];
else
i = i+1;
}
}
// to avoid memory leak
}
void computeNextArray(const string &pat, int M, int *next)
{
int len = 0; // lenght of the previous longest prefix suffix
int i = 1;
// next[0] is always 0
// the loop calculates next[i] for i = 1 to M-1
while(i < M)
{
if(pat[i] == pat[len])
{
len++;
next[i] = len;
i++;
}
else // (pat[i] != pat[len])
{
if( len != 0 )
// This is tricky. Consider the example AAACAAAA and i = 7.
len = next[len-1];
// Also, note that we do not increment i here
}
else // if (len == 0)
{
next[i] = 0;
i++;
}
}
}
}
int main()
{
"ABABDABACDABABCABAB";
"ABABCABAB";
KMPSearch(pat, txt);
"pause");
return 0;
}

数组next的优化

优化求出模式串的前缀和后缀的最大相同字符串长度数组next；下面先看下例子模式串pat=abab的优化数组next：index是数组下标标号，shift标志value值向右移一位之后，并把第一个值初始化为-1的值，next数组内的元素值是对shift值进一步优化；注意：next[i]是pat[0..i]的最长前缀和后缀相同的字符串，不包括当前位置i的字符，所以这里是优化之后的next数组。

[cpp] view plain copy

char: | a | b | a | b |
index: | 0 | 1 | 2 | 3 |
value: | 0 | 0 | 1 | 2 |
shift：| -1 | 0 | 0 | 1 |
next： | -1 | 0 | -1 | 0 |

下面通过例子讲解优化的过程，假设输入文本字符串和模式串分别为 txt = "abacababc"，pat = "abab"；

第一次匹配成功如下，若根据没有优化的数组进行匹配时，优化之前的数组为shift，则当前模式串字符b与文本字符c不匹配，当前匹配失败的字符位置是j=3；则模式串右移j-shift[j] = 3-1=2位，

[cpp] view plain copy

abacababc
|||
abab

经过上一步骤后，模式串字符b还是与文本字符c失配。而且失配对应的字符和上一步骤完全一样。事实上，因为在上一步的匹配中，已经得知pat[3] = b，与txt[3] = c失配，而右移两位之后，让pat[shift[3]] = pat[1] = b再跟txt[3]匹配时，必然失配。

[cpp] view plain copy

//x denotes a skip
abacababc
xx|
abab

问题是因为出现 pat[shift [j]]=pat[j]；因为当pat[j] != txt[i]时，下次匹配必然是pat[shift[j]]跟txt[i]匹配，如果pat[shift[j]]=pat[j]，必然导致后一步匹配失败，所以不能允许pat[shift[j]]=pat[j]。如果出现了pat[shift[j]]=pat[j]，则需要再次递归，即令shift[j]=shift[shift[j]]。则优化后的数组shift就是数组next；

我们重新看下模式串pat=abab的优化数组next；下面是优化数组next的操作过程：

[cpp] view plain copy

___________________________________________________________________________________
|char: | a | b | a | b |
|_________|_______________|___________________|_________________|_________________|
|index: | 0 | 1 | 2 | 3 |
|_________|_______________|___________________|_________________|_________________|
|value: | 0 | 0 | 1 | 2 |
|_________|_______________|___________________|_________________|_________________|
|shift： | -1 | 0 | 0 | 1 |
|_________|_______________|___________________|_________________|_________________|
|reason: | The initial | p[1]!=p[shift[1]] | p[2]=p[shift[2]]| p[3]=p[shift[3]]|
| |value unchanged| | | |
|_________|_______________|___________________|_________________|_________________|
|operator:|do nothing |do nothing | shift[2]= | shift[3]= |
| | | | shift[shift[2]] | shift[shift[3]] |
|_________|_______________|___________________|_________________|_________________|
|next： | -1 | 0 | -1 | 0 |
|_________|_______________|___________________|_________________|_________________|