你真的可以通过哈希将相同的字符串有效地分组吗?

Can you really divide identical strings into groups efficiently through hashes?

我正在阅读这篇文章 this article 讨论字符串哈希。在 “在字符串数组中搜索重复字符串” 部分声称您可以将时间复杂度为 [=11 的相同字符串分组=] 通过使用字符串散列。

查看文章中提供的代码示例

vector<vector<int>> group_identical_strings(vector<string> const& s) {
    int n = s.size();
    vector<pair<long long, int>> hashes(n);
    for (int i = 0; i < n; i++)
        hashes[i] = {compute_hash(s[i]), i};

    sort(hashes.begin(), hashes.end());

    vector<vector<int>> groups;
    for (int i = 0; i < n; i++) {
        if (i == 0 || hashes[i].first != hashes[i-1].first)
            groups.emplace_back();
        groups.back().push_back(hashes[i].second);
    }
    return groups;
}

我对这段代码的正确性感到非常困惑,因为它仅在 hashes[i].first != hashes[i-1].first 的条件下创建一个新组。两个字符串可以不同,但​​具有相同的哈希值,因此即使两个字符串不同,也可以将它们添加到同一组中?这个条件在我看来还不够。

我错了吗?这段代码正确吗?为什么?

如果不是,那么这个算法或至少这个复杂度真的可以实现吗?

你说得很对,两个不同的字符串可以有相同的散列值。这称为 hash collision. However, it boils down to which hash function you use. There are hash functions for which finding a collision is so unlikely that you can well use this algorithm without fear of it breaking. In cryptography, we rely on this property of cryptographically secure hash functions (see e.g. here).

事实上,您提到的来源说明如下:

That's the important part that you have to keep in mind. Using hashing will not be 100% deterministically correct, because two complete different strings might have the same hash (the hashes collide). However, in a wide majority of tasks, this can be safely ignored as the probability of the hashes of two different strings colliding is still very small. And we will discuss some techniques in this article how to keep the probability of collisions very low.

因此,正如您所说,该算法在数学上是不正确的。但是如果选择正确的哈希函数,它在实践中崩溃的可能性可以忽略不计。