面对插入和删除更新 Aho-Corasick trie
Updating an Aho-Corasick trie in the face of inserts and deletes
我发现的关于 Aho-Corasick 的所有文献和实现都是关于从一组短语预先构建整个 trie 的。但是,我对将其作为可变数据结构使用的方法很感兴趣,它可以处理偶尔的添加和删除,而无需重建整个 trie(假设其中有 100 万个条目)。最坏的情况很糟糕也没关系,只要平均情况接近对数。
据我所知,每个节点的失败状态是另一个使用相同符号的节点。因此,如果我们有一个从每个符号到使用该符号的节点列表的哈希多重映射,我们就有了失败状态需要更新的候选人。
删除很简单。您找到使用已删除节点作为故障状态的所有其他节点并重新计算它们的故障状态。从字符串的末尾向后移动,树 应该 仍然处于良好状态。
添加有点棘手。任何未能达到该符号的节点都可以将新节点作为更好的候选节点。但是,再次 似乎 足以遍历具有该符号的所有其他节点并完全重新计算其失败状态。
换句话说,如果我们要添加或删除带有符号 "A" 的节点,我们需要访问 trie 中任何其他 "A" 节点,并重新计算失败状态(看对于其最近的祖先 "A" 作为 child 或根)。这将需要访问符号 "A" 的每个节点,但在我的例子中,这将比访问 trie 中的每个节点少几个数量级。
这个算法行得通吗,还是我遗漏了一些明显的东西?
我继续实施它,似乎有效。对算法的更好描述,包括图片:https://burningmime.gitlab.io/setmatch/implementation-overview.html#supporting-add-remove-in-an-aho-corasick-trie
为了后代(以及根据 Whosebug 政策),我已将其复制到此处:
It supports modification (add/remove) instead of being built all at
once from a pre-set dictionary. This isn't mentioned in any literature
about it, and all the open-source implementations I've seen of it have
followed in that respect. It turns out that supporting add and remove
operations to an AC automaton is fairly trivial. For deep tries over a
small alphabet, it would not be very time-efficient, but luckily we
have a shallow trie over a large alphabet.
We store a hash multimap of each token to every node that uses that
token. When we remove a phrase, we start from the last (bottom-most)
node. We remove the pointer to the phrase from this bottom-most node.
Then, if it has any children, we can't delete any more. Otherwise, go
through each other node that uses this node as a fail state and
recompute its fail state. Finally, delete the node, then go up to the
node's parent and do the same process. We'll eventually reach either a
node that has another phrase output, a child, or the root node.
That's kind of difficult to picture, so consider this trie (stolen
from the Wikipedia article). We're going to remove the string caa
(light gray color), which will require that the fail states in yellow
be updated:
The result:
Note there are cases where a node's fail state might be updated 2 or
more times in the same remove operation, but will eventually be
correct. This could be avoided by removing everything first, then
going back through tokens, but the code is complicated enough as it
is, and that would only provide a performance benefit in certain
awkward circumstances, while being a performance hit in the average
case. We just assume that strings containing the same token more than
once are rare.
Adding a node is slightly more difficult, but can make use of the same
hash multimap structure. Each time a node is added, every other node
in the trie that uses the same token and is at a lower depth than the
added node could need to have its fail state updated to the new node.
To help picture that, imagine going from the second diagram back to
the first diagram. The same two fail states are the ones who would
need to be updated if adding the string caa back into that trie, and
since we recompute every c and a node, it's not a problem.
We could keep track of the node depths to skip over about half, but
right now it's just recomputing the fail state for every node that
shares the token to save memory. This means that tries where the
tokens appear many times will be slow to add to, but again that's
considered a rare enough situation. It would be a concern if you were
trying to adapt this technique to something like a DNA sequence or
antivirus database where there is a small alphabet.
The time complexity depends on how many nodes share symbols and how
deep the trie is, meaning it's worst-case O(N), but in practice a
fairly small number (average-case is roughly O(log^2 N) ).
非常混乱且大部分未注释的代码,但如果您好奇,这里是 the header, the body, and some tests
我发现的关于 Aho-Corasick 的所有文献和实现都是关于从一组短语预先构建整个 trie 的。但是,我对将其作为可变数据结构使用的方法很感兴趣,它可以处理偶尔的添加和删除,而无需重建整个 trie(假设其中有 100 万个条目)。最坏的情况很糟糕也没关系,只要平均情况接近对数。
据我所知,每个节点的失败状态是另一个使用相同符号的节点。因此,如果我们有一个从每个符号到使用该符号的节点列表的哈希多重映射,我们就有了失败状态需要更新的候选人。
删除很简单。您找到使用已删除节点作为故障状态的所有其他节点并重新计算它们的故障状态。从字符串的末尾向后移动,树 应该 仍然处于良好状态。
添加有点棘手。任何未能达到该符号的节点都可以将新节点作为更好的候选节点。但是,再次 似乎 足以遍历具有该符号的所有其他节点并完全重新计算其失败状态。
换句话说,如果我们要添加或删除带有符号 "A" 的节点,我们需要访问 trie 中任何其他 "A" 节点,并重新计算失败状态(看对于其最近的祖先 "A" 作为 child 或根)。这将需要访问符号 "A" 的每个节点,但在我的例子中,这将比访问 trie 中的每个节点少几个数量级。
这个算法行得通吗,还是我遗漏了一些明显的东西?
我继续实施它,似乎有效。对算法的更好描述,包括图片:https://burningmime.gitlab.io/setmatch/implementation-overview.html#supporting-add-remove-in-an-aho-corasick-trie
为了后代(以及根据 Whosebug 政策),我已将其复制到此处:
It supports modification (add/remove) instead of being built all at once from a pre-set dictionary. This isn't mentioned in any literature about it, and all the open-source implementations I've seen of it have followed in that respect. It turns out that supporting add and remove operations to an AC automaton is fairly trivial. For deep tries over a small alphabet, it would not be very time-efficient, but luckily we have a shallow trie over a large alphabet.
We store a hash multimap of each token to every node that uses that token. When we remove a phrase, we start from the last (bottom-most) node. We remove the pointer to the phrase from this bottom-most node. Then, if it has any children, we can't delete any more. Otherwise, go through each other node that uses this node as a fail state and recompute its fail state. Finally, delete the node, then go up to the node's parent and do the same process. We'll eventually reach either a node that has another phrase output, a child, or the root node.
That's kind of difficult to picture, so consider this trie (stolen from the Wikipedia article). We're going to remove the string caa (light gray color), which will require that the fail states in yellow be updated:
The result:
Note there are cases where a node's fail state might be updated 2 or more times in the same remove operation, but will eventually be correct. This could be avoided by removing everything first, then going back through tokens, but the code is complicated enough as it is, and that would only provide a performance benefit in certain awkward circumstances, while being a performance hit in the average case. We just assume that strings containing the same token more than once are rare.
Adding a node is slightly more difficult, but can make use of the same hash multimap structure. Each time a node is added, every other node in the trie that uses the same token and is at a lower depth than the added node could need to have its fail state updated to the new node.
To help picture that, imagine going from the second diagram back to the first diagram. The same two fail states are the ones who would need to be updated if adding the string caa back into that trie, and since we recompute every c and a node, it's not a problem.
We could keep track of the node depths to skip over about half, but right now it's just recomputing the fail state for every node that shares the token to save memory. This means that tries where the tokens appear many times will be slow to add to, but again that's considered a rare enough situation. It would be a concern if you were trying to adapt this technique to something like a DNA sequence or antivirus database where there is a small alphabet.
The time complexity depends on how many nodes share symbols and how deep the trie is, meaning it's worst-case O(N), but in practice a fairly small number (average-case is roughly O(log^2 N) ).
非常混乱且大部分未注释的代码,但如果您好奇,这里是 the header, the body, and some tests