std::unordered_map 是如何实现的

How std::unordered_map is implemented

这是我之前提出的一个问题,我发现我对 unordered_map 的实现方式有很多困惑。我相信很多其他人和我一样感到困惑。根据我没有看标准就知道的信息:

Every unordered_map implementation stores a linked list to external nodes in the array of buckets... No, that is not at all the most efficient way to implement a hash map for most common uses. Unfortunately, a small "oversight" in the specification of unordered_map all but requires this behavior. The required behavior is that iterators to elements must stay valid when inserting or deleting other elements

我希望有人可以解释实现以及它如何符合 C++ 标准定义(在性能要求方面),如果它真的不是实现哈希映射数据结构的最有效方式,它怎么能有待改进?

该标准有效地强制执行 std::unordered_set and std::unordered_map - and their "multi" brethren - use open hashing aka separate chaining,这意味着一个桶数组,每个桶都包含一个链表的头部†。这个要求很微妙:它是以下因素的结果:

  • 默认max_load_factor() being 1.0 (which means the table will resize whenever size() would otherwise exceed 1.0 times the bucket_count()
  • 保证 table 不会被重新散列,除非增长超过该负载因子。

如果没有链接,那将是不切实际的,因为与哈希的其他主要类别 table 实现的冲突 - closed hashing aka open addressing - become overwhelming as the load_factor()](https://en.cppreference.com/w/cpp/container/unordered_map/load_factor) 接近 1.

参考文献:

23.2.5/15: The insert and emplace members shall not affect the validity of iterators if (N+n) < z * B, where N is the number of elements in the container prior to the insert operation, n is the number of elements inserted, B is the container’s bucket count, and z is the container’s maximum load factor.

amongst the Effects of the constructor at 23.5.4.2/1: max_load_factor() returns 1.0.

† 为了在不传递任何空桶的情况下实现最佳迭代,GCC 的实现将带有迭代器的桶填充到一个包含所有值的单向链表中:迭代器立即指向元素 before 桶的元素,因此如果擦除桶的最后一个值,可以重新连接 next 指针。

关于您引用的文字:

No, that is not at all the most efficient way to implement a hash map for most common uses. Unfortunately, a small "oversight" in the specification of unordered_map all but requires this behavior. The required behavior is that iterators to elements must stay valid when inserting or deleting other elements

没有“疏忽”......所做的一切都是经过深思熟虑的,并且是在充分意识到的情况下完成的。确实可以采取其他妥协措施,但开放散列/链接方法是一般用途的合理妥协,它可以相当优雅地处理来自平庸散列函数的冲突,对于小或大都不会太浪费 key/value类型,并处理任意多 insert/erase 对,而不会像许多封闭哈希实现那样逐渐降低性能。

作为意识的证据,来自Matthew Austern's proposal here

I'm not aware of any satisfactory implementation of open addressing in a generic framework. Open addressing presents a number of problems:

• It's necessary to distinguish between a vacant position and an occupied one.

• It's necessary either to restrict the hash table to types with a default constructor, and to construct every array element ahead of time, or else to maintain an array some of whose elements are objects and others of which are raw memory.

• Open addressing makes collision management difficult: if you're inserting an element whose hash code maps to an already-occupied location, you need a policy that tells you where to try next. This is a solved problem, but the best known solutions are complicated.

• Collision management is especially complicated when erasing elements is allowed. (See Knuth for a discussion.) A container class for the standard library ought to allow erasure.

• Collision management schemes for open addressing tend to assume a fixed size array that can hold up to N elements. A container class for the standard library ought to be able to grow as necessary when new elements are inserted, up to the limit of available memory.

Solving these problems could be an interesting research project, but, in the absence of implementation experience in the context of C++, it would be inappropriate to standardize an open-addressing container class.

专门针对数据小到可以直接存储在桶中的只插入tables,一个方便的未使用桶的标记值,以及一个好的散列函数,一个封闭的散列方法可能大致是一个命令数量级更快并使用更少的内存,但这不是通用的。

散列 table 设计选项及其含义的完整比较和阐述与 S.O 无关。因为范围太广,无法在此处正确解决。