Knuth-Morris-Pratt 算法中的 DFA 构造

Question

我指的是 Sedgewick 的书 "Algorithms"（第 4 版）中用于子字符串搜索的 Knuth-Morris-Pratt (KMP) 算法的概述。

KMP 算法在基于确定性有限自动机 (DFA) 的子串搜索中使用备份。我了解DFA如何进入算法，但是我不明白如何构造 DFA，这是通过以下代码片段完成的：

dfa[pat.charAt(0)][0] = 1;
for (int X = 0; j = 1; j< M; j++) {
    for (int c = 0; c < R; c++) {
       dfa[c][j] = dfa[c][X];
    }
    dfa[pat.charAt(j)][j] = j+1;
    X = dfa[pat.charAt(j)][X];
}

其中 M 是模式的长度 pat，R 是字母表的大小。 charAt()函数returns相应位置字符的整数值。

有人能解释一下这段代码是用什么方式构造dfa的吗？我迷失在内部 for 循环的实际直观含义中。

Answer 1

让我们看一下模式 ACACAGA 的以下 FA。

以上图表代表 ACACAGA 模式的图形和表格表示。

这里，DFA 中的状态数是 M + 1，其中 M 是模式的长度。构造 DFA 的主要事情是为每个可能的字符从当前状态获取下一个状态。给定一个字符 x 和一个状态 k，我们可以通过考虑字符串“pat[0..k-1]x”来获得下一个状态，它基本上是模式字符 pat[0]、pat1 的串联…… pat[k-1] 和字符 x。这个想法是获取给定模式的最长前缀的长度，使得该前缀也是“pat[0..k-1]x”的后缀（LPS）。 length 的值给了我们下一个状态。

例如，让我们看看如何从上图中的当前状态5和字符'C'获取下一个状态。我们需要考虑字符串“pat[0..5]C”，即“ACACAC”。前缀为“ACACAC”后缀的模式的最长前缀的长度为 4（“ACAC”）。所以下一个状态（从状态 5 开始）是字符“C”的 4。

// dfa[i][j] = k denotes the transition function will go k'th state 
// with character i from state j

// DFA will go always (i + 1)'th state from i'th state 
//if the character c is equal to current character of pattern 
dfa[pat.charAt(0)][0] = 1;

//  here X is initialized with LPS (longest prefix suffix) 
// of pattern[0....j - 1]. LPS[0] is always 0
for (int X = 0; j = 1; j< M; j++) {
    for (int c = 0; c < R; c++) { // for all possible characters
        // transition function from j'th state taking character c 
        // will go to the same state where it went from X(LPS)'th state
        // taking character c (justify it with an example) 
        dfa[c][j] = dfa[c][X];
    }
    // DFA will go always (i + 1)th state from i'th state if 
    // the character c is equal to current character of pattern
    dfa[pat.charAt(j)][j] = j + 1;
    X = dfa[pat.charAt(j)][X]; // update LPS
}

Knuth-Morris-Pratt 算法中的 DFA 构造

DFA construction in Knuth-Morris-Pratt algorithm

string

algorithm

substring

dfa

knuth-morris-pratt