为什么在 Rabin Karp 算法中每次哈希值相同时我们都需要检查模式匹配

Question

我不明白为什么我们每次都需要检查子串匹配的原因 returns 模式和文本的相同值。返回的哈希值不是字符串唯一的吗？

Answer 1

Rabin Karp 算法中使用的散列函数是“rolling hash" such as the Rabin Fingerprint，选择它是因为它属性可以根据先前的散列轻松计算散列，而不是因为它的抗碰撞性。

在Rabin Karp算法中，我们需要计算滑动子串的哈希。比如说我们正在搜索此文本中的 24 个字符的字符串：

"this is the text we are comparing"

我们需要计算这些子字符串的哈希值：

"this is the text we are "
"his is the text we are c"
"is is the text we are co"
"s is the text we are com"
" is the text we are comp"
"is the text we are compa"
"s the text we are compar"
" the text we are compari"
"the text we are comparin"
"he text we are comparing"

所以我们选择一个 "rolling hash" 函数，在计算第一个子字符串的哈希值后，我们可以使用第一个哈希值计算第二个子字符串的哈希值，即从子字符串中删除的字符，以及添加到其中的字符：

"this is the text we are "  ->  hash1
"his is the text we are c"  ->  hash1 -t +c  ->  hash2

这样的 "rolling hash" 函数不一定是找到具有相同散列的两个字符串的可能性很小的函数，就像在加密散列函数中一样。因此，哈希相同的事实并不能保证子字符串与搜索字符串相同；因此我们需要做一个完整的字符串比较才能确定。

请注意，任何创建比输入短的散列的散列函数都必然会发生冲突。使用比输入字符串短得多的散列是 Rabin Karp 算法的重点；比较哈希比比较长字符串更有效。

为什么在 Rabin Karp 算法中每次哈希值相同时我们都需要检查模式匹配

Why do we need to check for a pattern match everytime the hash value is same in Rabin Karp algorithm

algorithm

hash

rabin-karp