这个简单的哈希函数背后的逻辑是什么?
What is the logic behind this simple hash function?
谁能告诉我这个简单的哈希函数背后的数学逻辑是什么。
#define HASHSIZE 101
unsigned hash_function(char *s) {
unsigned hashval;
for (hashval = 0; *s != '[=10=]'; s++)
// THIS NEXT LINE HOW DOES IT WORK WHY 31
hashval = *s + 31 * hashval;
return hashval % HASHSIZE;
}
在这里,我不是问指针和编程。我只是问下面的语句是如何工作的。
hashval = *s + 31 * hashval
假设您有一个值 x
带位...
b7 b6 b5 b4 b3 b2 b1 b0
当你乘以 31 时,你实际上是在将那些左移一位的位相加(这与乘以二具有相同的效果,就像在十进制乘以十时添加一个试验零一样),乘以二(即4x
), 三个 (8x
), 四个 (16x
):
b7 b6 b5 b4 b3 b2 b1 b0 + // 1x
b7 b6 b5 b4 b3 b2 b1 b0 0 + // + 2x = 3x
b7 b6 b5 b4 b3 b2 b1 b0 0 0 + // + 4x = 7x
b7 b6 b5 b4 b3 b2 b1 b0 0 0 0 + // + 8x = 15x
b7 b6 b5 b4 b3 b2 b1 b0 0 0 0 0 + // + 16x = 31x
许多单独的输出位受到许多输入值的影响,但是 直接和来自 "carries"来自 less-significant 列;例如如果 b1 = 1 和 b0 = 1,第二个最低有效输出位(左侧的“10"s column) will be 0 but carry over 1 to the "100”列。
输出位受许多输入位影响的属性有助于"mix"输出哈希值,提高质量。
尽管如此,虽然它可能比乘以 17 (16+1) 或 12 (8+4) 更好,因为它们只添加了几个 bit-shifted 原始值的副本而不是五个,与执行不止一次 运行ning 乘法运算的散列函数相比,这是一个 非常 弱的散列,我将使用一些统计分析来说明...
作为散列质量的样本,我散列了 四个 个可打印 ASCII 字符的所有组合,并查看了相同散列的次数结果值。我选择了四个,因为它在合理的时间范围内(几十秒)是最可行的。可用的代码 here(不确定它是否会在那里超时 - 在本地只有 运行)并且在这个 post.
的底部
通过下面的一两行输出来解释 per-line 格式可能会有所帮助:
- 只有 62 个输入(输入的 0.000076%)散列为唯一值
- 另外还有 62 次 2 个输入散列为相同的值(即 124 个输入被这个 2-way 冲突组所占,即 0.000152%)
- 这个和lower-collision组累计占输入的0.000228%
完整输出:
81450625 4-character printable ASCII combinations
#collisions 1 with #times 62 0.000076% 0.000076%
#collisions 2 with #times 62 0.000152% 0.000228%
#collisions 3 with #times 1686 0.006210% 0.006438%
#collisions 4 with #times 170 0.000835% 0.007273%
#collisions 5 with #times 62 0.000381% 0.007654%
#collisions 6 with #times 1686 0.012420% 0.020074%
#collisions 7 with #times 62 0.000533% 0.020606%
#collisions 8 with #times 170 0.001670% 0.022276%
#collisions 9 with #times 45534 0.503134% 0.525410%
#collisions 10 with #times 3252 0.039926% 0.565336%
#collisions 11 with #times 3310 0.044702% 0.610038%
#collisions 12 with #times 4590 0.067624% 0.677662%
#collisions 13 with #times 340 0.005427% 0.683089%
#collisions 14 with #times 456 0.007838% 0.690927%
#collisions 15 with #times 1566 0.028840% 0.719766%
#collisions 16 with #times 224 0.004400% 0.724166%
#collisions 17 with #times 124 0.002588% 0.726754%
#collisions 18 with #times 45422 1.003793% 1.730548%
#collisions 19 with #times 116 0.002706% 1.733254%
#collisions 20 with #times 3414 0.083830% 1.817084%
#collisions 21 with #times 1632 0.042077% 1.859161%
#collisions 22 with #times 3256 0.087945% 1.947106%
#collisions 23 with #times 58 0.001638% 1.948744%
#collisions 24 with #times 4702 0.138548% 2.087292%
#collisions 25 with #times 66 0.002026% 2.089317%
#collisions 26 with #times 286 0.009129% 2.098447%
#collisions 27 with #times 1969365 65.282317% 67.380763%
#collisions 28 with #times 498 0.017120% 67.397883%
#collisions 29 with #times 58 0.002065% 67.399948%
#collisions 30 with #times 284614 10.482940% 77.882888%
#collisions 31 with #times 5402 0.205599% 78.088487%
#collisions 32 with #times 108 0.004243% 78.092730%
#collisions 33 with #times 289884 11.744750% 89.837480%
#collisions 34 with #times 5344 0.223075% 90.060555%
#collisions 35 with #times 5344 0.229636% 90.290191%
#collisions 36 with #times 146792 6.487994% 96.778186%
#collisions 38 with #times 5344 0.249319% 97.027505%
#collisions 39 with #times 20364 0.975064% 98.002569%
#collisions 40 with #times 9940 0.488148% 98.490718%
#collisions 42 with #times 14532 0.749342% 99.240060%
#collisions 43 with #times 368 0.019428% 99.259488%
#collisions 44 with #times 10304 0.556627% 99.816114%
#collisions 45 with #times 368 0.020331% 99.836446%
#collisions 46 with #times 368 0.020783% 99.857229%
#collisions 47 with #times 736 0.042470% 99.899699%
#collisions 48 with #times 368 0.021687% 99.921386%
#collisions 49 with #times 368 0.022139% 99.943524%
#collisions 50 with #times 920 0.056476% 100.000000%
总体观察:
- 65.3% 的输入以 27 向碰撞告终; 33路11.7%; 30 向碰撞中为 10.5%; 36 路 6.5%...
- 不到 2.1% 的输入避免了 27 向或更糟的碰撞
尽管输入仅占散列值的 1.9% space(在 2^32 中计算为 81450625,因为我们散列为 32 位值)。太惨了。
为了展示它是多么糟糕,让我们比较一下将 4 个可打印的 ASCII 字符放入 GCC std::string
s 中用于 GCC std::hash<std::string>
,我相信从记忆中使用 MURMUR32 哈希:
81450625 4-character printable ASCII combinations
#collisions 1 with #times 79921222 98.122294% 98.122294%
#collisions 2 with #times 757434 1.859860% 99.982155%
#collisions 3 with #times 4809 0.017713% 99.999867%
#collisions 4 with #times 27 0.000133% 100.000000%
所以 - 回到为什么 + 31 * previous
的问题 - 你必须与其他同样简单的散列函数进行比较,看看这是否优于生成 CPU 的平均水平哈希,尽管它在绝对意义上如此糟糕,但它是模糊可能的,但考虑到 大量 更好的哈希的额外成本非常小,我建议使用一个并忘记 "*31 " 完全。
代码:
#include <map>
#include <iostream>
#include <iomanip>
int main()
{
std::map<unsigned, unsigned> histogram;
for (int i = ' '; i <= '~'; ++i)
for (int j = ' '; j <= '~'; ++j)
for (int k = ' '; k <= '~'; ++k)
for (int l = ' '; l <= '~'; ++l)
{
unsigned hv = ((i * 31 + j) * 31 + k) * 31 + l;
/*
// use "31*" hash above OR std::hash<std::string> below...
char c[] = { i, j, k, l, '[=14=]' };
unsigned hv = std::hash<std::string>()(c); */
++histogram[hv];
}
std::map<unsigned, unsigned> histohisto;
for (auto& hv_freq : histogram)
++histohisto[hv_freq.second];
unsigned n = '~' - ' ' + 1; n *= n; n *= n;
std::cout << n << " 4-character printable ASCII combinations\n";
double cumulative_percentage = 0;
for (auto& freq_n : histohisto)
{
double percent = (double)freq_n.first * freq_n.second / n * 100;
cumulative_percentage += percent;
std::cout << "#collisions " << freq_n.first << " with #times " << freq_n.second << "\t\t" << std::fixed << percent << "% " << cumulative_percentage << "%\n";
}
}
谁能告诉我这个简单的哈希函数背后的数学逻辑是什么。
#define HASHSIZE 101
unsigned hash_function(char *s) {
unsigned hashval;
for (hashval = 0; *s != '[=10=]'; s++)
// THIS NEXT LINE HOW DOES IT WORK WHY 31
hashval = *s + 31 * hashval;
return hashval % HASHSIZE;
}
在这里,我不是问指针和编程。我只是问下面的语句是如何工作的。
hashval = *s + 31 * hashval
假设您有一个值 x
带位...
b7 b6 b5 b4 b3 b2 b1 b0
当你乘以 31 时,你实际上是在将那些左移一位的位相加(这与乘以二具有相同的效果,就像在十进制乘以十时添加一个试验零一样),乘以二(即4x
), 三个 (8x
), 四个 (16x
):
b7 b6 b5 b4 b3 b2 b1 b0 + // 1x
b7 b6 b5 b4 b3 b2 b1 b0 0 + // + 2x = 3x
b7 b6 b5 b4 b3 b2 b1 b0 0 0 + // + 4x = 7x
b7 b6 b5 b4 b3 b2 b1 b0 0 0 0 + // + 8x = 15x
b7 b6 b5 b4 b3 b2 b1 b0 0 0 0 0 + // + 16x = 31x
许多单独的输出位受到许多输入值的影响,但是 直接和来自 "carries"来自 less-significant 列;例如如果 b1 = 1 和 b0 = 1,第二个最低有效输出位(左侧的“10"s column) will be 0 but carry over 1 to the "100”列。
输出位受许多输入位影响的属性有助于"mix"输出哈希值,提高质量。
尽管如此,虽然它可能比乘以 17 (16+1) 或 12 (8+4) 更好,因为它们只添加了几个 bit-shifted 原始值的副本而不是五个,与执行不止一次 运行ning 乘法运算的散列函数相比,这是一个 非常 弱的散列,我将使用一些统计分析来说明...
作为散列质量的样本,我散列了 四个 个可打印 ASCII 字符的所有组合,并查看了相同散列的次数结果值。我选择了四个,因为它在合理的时间范围内(几十秒)是最可行的。可用的代码 here(不确定它是否会在那里超时 - 在本地只有 运行)并且在这个 post.
的底部通过下面的一两行输出来解释 per-line 格式可能会有所帮助:
- 只有 62 个输入(输入的 0.000076%)散列为唯一值
- 另外还有 62 次 2 个输入散列为相同的值(即 124 个输入被这个 2-way 冲突组所占,即 0.000152%)
- 这个和lower-collision组累计占输入的0.000228%
完整输出:
81450625 4-character printable ASCII combinations
#collisions 1 with #times 62 0.000076% 0.000076%
#collisions 2 with #times 62 0.000152% 0.000228%
#collisions 3 with #times 1686 0.006210% 0.006438%
#collisions 4 with #times 170 0.000835% 0.007273%
#collisions 5 with #times 62 0.000381% 0.007654%
#collisions 6 with #times 1686 0.012420% 0.020074%
#collisions 7 with #times 62 0.000533% 0.020606%
#collisions 8 with #times 170 0.001670% 0.022276%
#collisions 9 with #times 45534 0.503134% 0.525410%
#collisions 10 with #times 3252 0.039926% 0.565336%
#collisions 11 with #times 3310 0.044702% 0.610038%
#collisions 12 with #times 4590 0.067624% 0.677662%
#collisions 13 with #times 340 0.005427% 0.683089%
#collisions 14 with #times 456 0.007838% 0.690927%
#collisions 15 with #times 1566 0.028840% 0.719766%
#collisions 16 with #times 224 0.004400% 0.724166%
#collisions 17 with #times 124 0.002588% 0.726754%
#collisions 18 with #times 45422 1.003793% 1.730548%
#collisions 19 with #times 116 0.002706% 1.733254%
#collisions 20 with #times 3414 0.083830% 1.817084%
#collisions 21 with #times 1632 0.042077% 1.859161%
#collisions 22 with #times 3256 0.087945% 1.947106%
#collisions 23 with #times 58 0.001638% 1.948744%
#collisions 24 with #times 4702 0.138548% 2.087292%
#collisions 25 with #times 66 0.002026% 2.089317%
#collisions 26 with #times 286 0.009129% 2.098447%
#collisions 27 with #times 1969365 65.282317% 67.380763%
#collisions 28 with #times 498 0.017120% 67.397883%
#collisions 29 with #times 58 0.002065% 67.399948%
#collisions 30 with #times 284614 10.482940% 77.882888%
#collisions 31 with #times 5402 0.205599% 78.088487%
#collisions 32 with #times 108 0.004243% 78.092730%
#collisions 33 with #times 289884 11.744750% 89.837480%
#collisions 34 with #times 5344 0.223075% 90.060555%
#collisions 35 with #times 5344 0.229636% 90.290191%
#collisions 36 with #times 146792 6.487994% 96.778186%
#collisions 38 with #times 5344 0.249319% 97.027505%
#collisions 39 with #times 20364 0.975064% 98.002569%
#collisions 40 with #times 9940 0.488148% 98.490718%
#collisions 42 with #times 14532 0.749342% 99.240060%
#collisions 43 with #times 368 0.019428% 99.259488%
#collisions 44 with #times 10304 0.556627% 99.816114%
#collisions 45 with #times 368 0.020331% 99.836446%
#collisions 46 with #times 368 0.020783% 99.857229%
#collisions 47 with #times 736 0.042470% 99.899699%
#collisions 48 with #times 368 0.021687% 99.921386%
#collisions 49 with #times 368 0.022139% 99.943524%
#collisions 50 with #times 920 0.056476% 100.000000%
总体观察:
- 65.3% 的输入以 27 向碰撞告终; 33路11.7%; 30 向碰撞中为 10.5%; 36 路 6.5%...
- 不到 2.1% 的输入避免了 27 向或更糟的碰撞
尽管输入仅占散列值的 1.9% space(在 2^32 中计算为 81450625,因为我们散列为 32 位值)。太惨了。
为了展示它是多么糟糕,让我们比较一下将 4 个可打印的 ASCII 字符放入 GCC std::string
s 中用于 GCC std::hash<std::string>
,我相信从记忆中使用 MURMUR32 哈希:
81450625 4-character printable ASCII combinations
#collisions 1 with #times 79921222 98.122294% 98.122294%
#collisions 2 with #times 757434 1.859860% 99.982155%
#collisions 3 with #times 4809 0.017713% 99.999867%
#collisions 4 with #times 27 0.000133% 100.000000%
所以 - 回到为什么 + 31 * previous
的问题 - 你必须与其他同样简单的散列函数进行比较,看看这是否优于生成 CPU 的平均水平哈希,尽管它在绝对意义上如此糟糕,但它是模糊可能的,但考虑到 大量 更好的哈希的额外成本非常小,我建议使用一个并忘记 "*31 " 完全。
代码:
#include <map>
#include <iostream>
#include <iomanip>
int main()
{
std::map<unsigned, unsigned> histogram;
for (int i = ' '; i <= '~'; ++i)
for (int j = ' '; j <= '~'; ++j)
for (int k = ' '; k <= '~'; ++k)
for (int l = ' '; l <= '~'; ++l)
{
unsigned hv = ((i * 31 + j) * 31 + k) * 31 + l;
/*
// use "31*" hash above OR std::hash<std::string> below...
char c[] = { i, j, k, l, '[=14=]' };
unsigned hv = std::hash<std::string>()(c); */
++histogram[hv];
}
std::map<unsigned, unsigned> histohisto;
for (auto& hv_freq : histogram)
++histohisto[hv_freq.second];
unsigned n = '~' - ' ' + 1; n *= n; n *= n;
std::cout << n << " 4-character printable ASCII combinations\n";
double cumulative_percentage = 0;
for (auto& freq_n : histohisto)
{
double percent = (double)freq_n.first * freq_n.second / n * 100;
cumulative_percentage += percent;
std::cout << "#collisions " << freq_n.first << " with #times " << freq_n.second << "\t\t" << std::fixed << percent << "% " << cumulative_percentage << "%\n";
}
}