64 位 OS/compile 的哈希函数，对于实际上只是 4 字节 int 的对象

Question

我有一个名为 Foo 的 class，它私下只不过是 4 字节的 int。如果我 return 它的值为 8 字节 size_t，我会搞砸 unordered_map<> 还是其他什么？我可以用 return foo + foo << 32; 之类的东西填充所有位。那会更好，还是更糟，因为现在所有哈希值都是 0x100000001 的倍数？或者 return ~foo + foo << 32; 会使用所有 64 位并且也没有公因数怎么样？

namespace std {
  template<> struct hash<MyNamespace::Foo> {
    typedef size_t result_type;
    typedef MyNamespace::Foo argument_tupe;
    size_t operator() (const MyNamespace::Foo& f ) const { return (size_t) f.u32InternalValue; }
  };
}

Answer 1

增量 uint32_t 密钥转换为 uint64_t 效果很好

unordered_map 将为 hash-table 增量保留 space。

key的低位用于确定bucket位置，以4entries/buckets为例，使用低位2位。具有提供相同桶（桶数的倍数）的键的元素链接在链表中。这带有load-factor.

的概念

// 4 Buckets example

******** ******** ******** ******** ******** ******** ******** ******XX

bucket 00 would contains keys like {0, 256, 200000 ...}
bucket 01 would contains keys like {1, 513, 4008001 ...}
bucket 10 would contains keys like {2, 130, 10002 ...}
bucket 11 would contains keys like {3, 259, 1027, 20003, ...}

如果您尝试在桶中保存额外的值，并且加载因子超过限制，则 table 会调整大小（例如，您尝试在 4 桶中保存第 5 个元素 table 与 load_factor=1.0).

因此：

在达到 2^32 个元素之前，使用 uint32_t 或 uint64_t 键几乎没有影响 hash-table。

Would that be better, or would it be worse as all hashes are now multiples of 0x100000001?

在达到 32 位溢出 (2^32) 之前，这不会产生任何影响 hash-table。

增量 uint32_t 和 uint64_t 之间的良好密钥转换：

key64 = static_cast<uint64>(key32);

增量 uint32_t 和 uint64_t 之间的错误密钥转换：

key64 = static_cast<uint64>(key32)<<32;

最好的办法是尽可能保持密钥均匀，避免一次又一次地使用相同的因子进行哈希。例如。在下面的代码中，所有因子为 7 的键都会发生冲突，直到大小调整为 16 个桶。

https://onlinegdb.com/r1N7TNySv

#include <iostream>
#include <unordered_map>

using namespace std;

// Print to std output the internal structure of an unordered_map.
template <typename K, typename T>
void printMapStruct(unordered_map<K, T>& map)
{
    cout << "The map has " << map.bucket_count()<< 
        " buckets and max load factor: " << map.max_load_factor() << endl;
    
    for (size_t i=0; i< map.bucket_count(); ++i)
    {
        cout << "    Bucket " << i << ": ";
        for (auto it=map.begin(i); it!=map.end(i); ++it)
        {
            cout << it->first << " ";
        }
        cout << endl;
    }
    cout << endl;
}

// Print the list of bucket sizes by this implementation
void printMapResizes()
{
    cout << "Map bucket counts:"<< endl;
    unordered_map<size_t, size_t> map;
    
    size_t lastBucketSize=0;
    for (size_t i=0; i<1024*1024; ++i)
    {
        if (lastBucketSize!=map.bucket_count())
        {
            cout << map.bucket_count() << " ";
            lastBucketSize = map.bucket_count();
        }
        map.emplace(i,i);
    }
    cout << endl;
}

int main()
{
    unordered_map<size_t,size_t> map;
    
    printMapStruct(map);
    
    map.emplace(0,0);
    map.emplace(1,1);
    printMapStruct(map);
    
    map.emplace(72,72);
    map.emplace(17,17);
    printMapStruct(map);
    
    map.emplace(7,7);
    map.emplace(14,14);
    printMapStruct(map);
    
    printMapResizes();

    return 0;
}

注意存储桶计数：

在上面的例子中，桶数如下：

1 3 7 17 37 79 167 337 709 1493 3209 6427 12983 26267 53201 107897 218971 444487 902483 1832561

这似乎是故意遵循一系列素数（尽量减少碰撞）。我不知道背后的功能。

std::unordered_map<> bucket_count() after default rehash

64 位 OS/compile 的哈希函数，对于实际上只是 4 字节 int 的对象

hash function for a 64-bit OS/compile, for an object that's really just a 4-byte int

hashcode

c++11