在特里树中实现模式匹配

Question

我目前正在使用这个堆栈溢出的 trie 实现 post:

Getting a list of words from a Trie

到return匹配给定前缀的单词列表。然后我使用正则表达式过滤掉不符合整个指定模式的单词。

EX：如果我要搜索的模式是：CH??S?这是与我的初始前缀匹配的字典的子集： {CHABAD、CHACHA、CHARIOT、CHATTED、CHEATER、CHOMSKY、CHANNEL CHAFED、CHAFER、CHAIRS、CHAIRS、CHEESE、CHEESY CHRONO、CHUTES、CHISEL}

我会搜索带有 'CH' 前缀的 trie，然后过滤出与我想要的 CH??S? 模式匹配的单词。（干酪、干酪、凿子）和 return 那些。

我想知道是否有更快的方法来避免在最后一步使用正则表达式。我想我可以使用后缀树 (Ukkonen's suffix tree algorithm in plain English ) 或 boyer-moore 算法，但都不起作用，因为它们搜索后缀而不是模式。

Answer 1

这里有一个很好的递归算法，您可以使用它来消除使用最终正则表达式传递的需要。它通过将模式 P 与树 T 进行匹配来工作：

FindMatches(pattern P, tree T) {
    if (tree is empty) {
        return that there are no matches;
    }

    if (pattern is empty) {
        report a match if T is a word;
    } else if (pattern[0] is a letter) {
        FindMatches(P[1:], T.childFor(pattern[0]));
    } else if (pattern[0] is ?) {
        for (each child C of T) {
             gather matches from FindMatches(P[1:], C);
        }
        report all matches found this way;
    } else {
        something is wrong with your pattern;
    }
}

您仅匹配 CH 的方法遵循此方法。但是，当您到达匹配 ? 个字符的位置时，您的方法只是执行 DFS 以查找当前点以下的所有单词，而这种方法只是在一个级别上分支并收集匹配项。换句话说，这种方法是 "follow one character at a time if there's a clear character to follow" 和 "explore the whole subtree looking for matches," 的混合体，在战略上侧重于分支。

在特里树中实现模式匹配

Implementing pattern matching in a trie tree

java

string

pattern-matching

trie

data-structures