字典和哈希表 space 复杂度

Question

我对字典和哈希表有些困惑，我想澄清一下。假设我有当前字典和当前 python 运行.[=14 的哈希的当前输出=]

Dict = dict()
print(hash('a'))
print(hash('b'))
print(hash('c'))
Dict['a'] = 1
Dict['b'] = 2
Dict['c'] = 3
print(Dict)

具有

的输出

1714333803
1519074822
1245896149
{'a': 1, 'c': 3, 'b': 2}

据我所知，哈希表只是一个数组，其中哈希是哈希表的索引。例如，'a' 的哈希值为 1714333803，因此我的哈希表索引 1714333803 的值为 'a'。所以我很困惑哈希表有多少个索引以及哈希函数如何产生答案？它是否使用模数并具有固定范围的索引？因为字典的给定打印输出 {'a': 1, 'c': 3, 'b': 2}，但是假设即使它输出那个，字典实际上是至少 1714333803 个索引的数组是正确的，因为包含 3 个元素似乎太过分了，更不用说了这是多么浪费 space。同样对于哈希表，索引中有什么没有值，null?

Answer 1

dict 的实际大小取决于实现，但在您的情况下，它可能是 8。那么，这是如何工作的？

dict（或一般的哈希映射）的工作原理是为每个键计算一个数字哈希。例如，在您的情况下是 hash("a") == 1714333803。现在，散列不直接用作索引。相反，它被映射到字典的大小。

一个简单的方法是取模 (%)。假设您的 dict 大小为 8。然后hash("a") % 8 == 1714333803 % 8 == 3。所以你的项目实际上在第 4 位。通过构造查找算法，任何项目都不能在数组之外有索引。

这里有一些更复杂的事情，比如哈希冲突。例如，如果另一个项目具有散列 98499，则也映射到 3。在这种情况下，有一些冲突解决策略可以选择不同的索引。他们大多试图统一大步走阵列。

那么，为什么您的 dict 是 8 码？因为那是 default size in python. Once your dict get's too small, it must be resized. In contrast to arrays, this is done before the dict is actually full - namely, at two thirds filling. This is done to reduce hash collisions - if your dict is 99% full, a collision is practically guaranteed. For a size 8 dict, you'd have to enter 5-6 items before it resizes, namely doubles its capacity 到 16.

请注意 CPython 3.6+ and PyPy (for a long time) use a two-stage data structure 代表 dict。第一阶段是一个散列table，但第二阶段不是。这将键映射（第一阶段）和数据存储（第二阶段）分开。稀疏的第一阶段为紧凑的第二阶段提供索引：

# based on Raymond Hettingers mail on python-dev
# the key mapping, using a hashtable
# indices[hash(key) % length] => data index
indices =  [None, None, None, 0, None, 2, 1, None]

# the data storage, packed in insertion order
# entries[index] => hash(key), key, value
entries =  [[1714333803, 'a', 1],
            [1519074822, 'b', 2],
            [1245896149, 'c', 3]]

此方案在算法上对于查找（由于间接）更复杂，但对于迭代（直接在数据存储上）则不那么复杂，并且内存效率更高。只有索引 table 是稀疏的，需要超大。除非项目被删除，否则数据存储量恰好与需要的一样大。

字典和哈希表 space 复杂度

Dictionaries and Hashtables space complexity

python

dictionary

space

hashtable