为什么以下三个字符串的哈希码相同?

Why the following three strings's hashcode are same?

看了JDK的源码后,还是很惊讶,字符串 "AaAa", "AaBB" and "BBBB" 具有相同的哈希码。

JDK来源如下,

int h = hash;
if (h == 0 && value.length > 0) {
    char val[] = value;

    for (int i = 0; i < value.length; i++) {
        h = 31 * h + val[i];
    }
    hash = h;
}
return h;

谁能澄清一下?

因为这就是 the hash code is defined to be calculated for a String:

The hash code for a String object is computed as

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

所以:

  • 对于AaAa65*31^3 + 97*31^2 + 65*31 + 97 = 2031744
  • 对于AaBB65*31^3 + 97*31^2 + 66*31 + 66 = 2031744
  • BBBB66*31^3 + 66*31^2 + 66*31 + 66 = 2031744

他们的哈希码是

AaAa: ((65 * 31 + 97) * 31 + 65) * 31 + 97 = 2.031.744
AaBB: ((65 * 31 + 97) * 31 + 66) * 31 + 66 = 2.031.744
BBBB: ((66 * 31 + 66) * 31 + 66) * 31 + 66 = 2.031.744

数学就是这样,没什么好混淆的。
请注意 97 和 66 之间正好有 31 的差异,这就是使这些哈希码排列得如此漂亮的原因。

因为probability.

大约有 40 亿种可能的哈希码 (Integer.MIN_VALUE -> Integer.MAX_VALUE) 和基本上无限可能的字符串。必然有 collisions. In fact, the birthday problem shows us that only ~77,000 strings 是任意碰撞的高可能性所必需的 - 如果哈希函数具有极高的熵,则它不会。

也许您正在考虑 cryptographic hash function,其中

a small change to a message should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value

在这种情况下,Object.hashCode 不是为加密目的而设计的。

另见 How secure is Java's hashCode()?

这里是Java文档中Object#hashCode方法的描述:

Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified.This integer need not remain consistent from one execution of an application to another execution of the same application.

If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.

It is not required that if two objects are unequal according to the java.lang.Object#equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.

所以,执行Stringclass也维持上面characteristics.So这是正常现象。

有多种具有不同设计和性能标准的哈希函数。

  1. 用于索引的散列函数(例如关联数组和类似用法)可能会频繁发生冲突,但没有问题,因为散列 table 代码随后会在某些命名器中进行处理,例如将它们放入列表或重新散列。这一切都与时间的表现有关。 Javahash()好像是这种

  2. 另一种函数,如 SHA* 等加密散列,以牺牲散列性能为代价努力避免冲突。

  3. 然而,第三种类型的哈希函数是密码验证器哈希,其设计速度非常慢(~100 毫秒很常见)并且可能需要大量内存并且不经常发生冲突不用担心。这里的重点是使暴力攻击花费的时间尽可能长,以至于不可行。

Once 根据用途选择哈希的类型和特征。