为稀疏 64 位无符号整数创建最小完美哈希

Create Minimum perfect hash for sparse 64-bit unsigned integer

我需要一个 64 位到 16 位的完美散列函数来处理稀疏填充的键列表。

我在 python 中有一本字典,它有 48326 个长度为 64 位的键。我想为这个键列表创建一个最小的完美散列。 (我不想等几天才能计算 MPH,所以我也可以将其映射到 16 位哈希)

objective最终将这个字典移植到C中,作为一个数组,其中包含dict值,索引是通过以key为输入的最小完美哈希函数计算的。我无法在正在构建的应用程序的 C 端口中使用外部哈希库

问题: 是否有任何 python 库将我的密钥作为输入并向我提供散列参数和(基于用于散列的定义算法)作为输出。

我找到了一个库 perfection 2.0.0,但由于我的密钥是 64 位形式,所以它挂了。 (即使我在 2000 个键的子集上测试它)

编辑 正如评论中所建议的,我查看了 Steve Hanov's Algo and modified the hash function to take a 64 bit integer (changing values of the FNV Prime and offset as per this wiki page)

虽然我得到了结果,但不幸的是 Maps return -ve 索引值虽然我可以让它工作,但这意味着我必须通过检查 -ve 索引 [= 来向散列计算添加另外 4 个周期12=]

想避免这种情况

就个人而言,我会用 gperf, or for a large number of keys, with CMPH 生成一个 table,然后完成它。

如果你必须在 Python 中执行此操作,那么我发现 this blog post 和一些 Python 2 代码可以非常有效地转换 string 使用中间人 table 将密钥转换为最小的完美哈希。

根据您的要求调整 post 中的代码,在 0.35 秒内生成 50k 项目的最小完美散列:

>>> import random
>>> testdata = {random.randrange(2**64): random.randrange(2**64)
...             for __ in range(50000)}  # 50k random 64-bit keys
>>> import timeit
>>> timeit.timeit('gen_minimal_perfect_hash(testdata)', 'from __main__ import  gen_minimal_perfect_hash, testdata', number=10)
3.461486832005903

我所做的修改:

  • 我切换到 Python 3,遵循 Python 风格指南并使代码更 Pythonic
  • 我正在使用 int.to_bytes()
  • 将 64 位无符号整数转换为 8 字节字节串
  • 我没有存储负数,而是使用一个标志来区分中间值中的直接输入值和哈希输入值 table

适配代码:

# Easy Perfect Minimal Hashing
# By Steve Hanov. Released to the public domain.
# Adapted to Python 3 best practices and 64-bit integer keys by Martijn Pieters
#
# Based on:
# Edward A. Fox, Lenwood S. Heath, Qi Fan Chen and Amjad M. Daoud,
# "Practical minimal perfect hash functions for large databases", CACM, 35(1):105-121
# also a good reference:
# Compress, Hash, and Displace algorithm by Djamal Belazzougui,
# Fabiano C. Botelho, and Martin Dietzfelbinger
from itertools import count, groupby


def fnv_hash_int(value, size, d=0x811c9dc5):
    """Calculates a distinct hash function for a given 64-bit integer.

    Each value of the integer d results in a different hash value. The return
    value is the modulus of the hash and size.

    """
    # Use the FNV algorithm from http://isthe.com/chongo/tech/comp/fnv/
    # The unsigned integer is first converted to a 8-character byte string.
    for c in value.to_bytes(8, 'big'):
        d = ((d * 0x01000193) ^ c) & 0xffffffff

    return d % size


def gen_minimal_perfect_hash(dictionary, _hash_func=fnv_hash_int):
    """Computes a minimal perfect hash table using the given Python dictionary.

    It returns a tuple (intermediate, values). intermediate and values are both
    lists. intermediate contains the intermediate table of indices needed to
    compute the index of the value in values; a tuple of (flag, d) is stored, where
    d is either a direct index, or the input for another call to the hash function.
    values contains the values of the dictionary.

    """
    size = len(dictionary)

    # Step 1: Place all of the keys into buckets
    buckets = [[] for __ in dictionary]
    intermediate = [(False, 0)] * size
    values = [None] * size

    for key in dictionary:
        buckets[_hash_func(key, size)].append(key)

    # Step 2: Sort the buckets and process the ones with the most items first.
    buckets.sort(key=len, reverse=True)
    # Only look at buckets of length greater than 1 first; partitioned produces
    # groups of buckets of lengths > 1, then those of length 1, then the empty
    # buckets (we ignore the last group).
    partitioned = (g for k, g in groupby(buckets, key=lambda b: len(b) != 1))
    for bucket in next(partitioned, ()):
        # Try increasing values of d until we find a hash function
        # that places all items in this bucket into free slots
        for d in count(1):
            slots = {}
            for key in bucket:
                slot = _hash_func(key, size, d=d)
                if values[slot] is not None or slot in slots:
                    break
                slots[slot] = dictionary[key]
            else:
                # all slots filled, update the values table; False indicates
                # these values are inputs into the hash function
                intermediate[_hash_func(bucket[0], size)] = (False, d)
                for slot, value in slots.items():
                    values[slot] = value
                break

    # The next group is buckets with only 1 item. Process them more quickly by
    # directly placing them into a free slot.
    freelist = (i for i, value in enumerate(values) if value is None)

    for bucket, slot in zip(next(partitioned, ()), freelist):
        # These are 'direct' slot references
        intermediate[_hash_func(bucket[0], size)] = (True, slot)
        values[slot] = dictionary[bucket[0]]

    return (intermediate, values)


def perfect_hash_lookup(key, intermediate, values, _hash_func=fnv_hash_int):
    "Look up a value in the hash table defined by intermediate and values"
    direct, d = intermediate[_hash_func(key, len(intermediate))]
    return values[d if direct else _hash_func(key, len(values), d=d)]

上面生成了两个列表,每个列表有 50k 个条目;第一个 table 中的值是 (boolean, integer) 元组,整数在 [0, tablesize) 范围内(理论上值的范围可达 2^16 但如果它曾经出现我会感到非常惊讶进行了 65k+ 次尝试来为您的数据找到一个插槽安排)。您的 table 大小小于 50k,因此上述安排可以将条目存储在该列表中的 4 个字节中(boolshort 为 3,但 alignment rules 添加一个字节填充)将其表示为 C 数组时。

快速测试以查看散列 table 是否正确并再次生成正确的输出:

>>> tables = gen_minimal_perfect_hash(testdata)
>>> for key, value in testdata.items():
...     assert perfect_hash_lookup(key, *tables) == value
...

只需要在C:

中实现查找功能即可
  • fnv_hash_int 操作可以获取一个指向 64 位整数的指针,然后将该指针转换为一个 8 位值的数组,并将索引递增 8 次以访问每个单独的字节;使用 suitable function to ensure big-endian (network) order.
  • 您不需要在 C 语言中使用 0xffffffff 进行屏蔽,因为 C 整数值的溢出无论如何都会被自动丢弃。
  • len(intermediate) == len(values) == len(dictionary)并且可以在常量中捕获。
  • 假设 C99,将中间 table 存储为结构类型数组,其中 flagboold 为无符号 short ;这只是 3 个字节,加上 1 个填充字节以在 4 字节边界上对齐。 values 数组中的数据类型取决于输入字典中的值。

如果你原谅我的 C 技能,这里有一个示例实现:

mph_table.h

#include "mph_generated_table.h"
#include <arpa/inet.h>
#include <stdint.h>

#ifndef htonll
// see 
#define htonll(x) ((1==htonl(1)) ? (x) : ((uint64_t)htonl((x) & 0xFFFFFFFF) << 32) | htonl((x) >> 32))
#endif

uint64_t mph_lookup(uint64_t key);

mph_table.c

#include "mph_table.h"
#include <stdbool.h>
#include <stdint.h>

#define FNV_OFFSET 0x811c9dc5
#define FNV_PRIME 0x01000193

uint32_t fnv_hash_modulo_table(uint32_t d, uint64_t key) {
    d = (d == 0) ? FNV_OFFSET : d;
    uint8_t* keybytes = (uint8_t*)&key;
    for (int i = 0; i < 8; ++i) {
        d = (d * FNV_PRIME) ^ keybytes[i];
    }
    return d % TABLE_SIZE;
}

uint64_t mph_lookup(uint64_t key) {
    _intermediate_entry entry = 
        mph_tables.intermediate[fnv_hash_modulo_table(0, htonll(key))];
    return mph_tables.values[
        entry.flag ?
            entry.d : 
            fnv_hash_modulo_table((uint32_t)entry.d, htonll(key))];
}

这将依赖于生成的头文件,生成自:

from textwrap import indent

template = """\
#include <stdbool.h>
#include <stdint.h>

#define TABLE_SIZE %(size)s

typedef struct _intermediate_entry {
    bool flag;
    uint16_t d;
} _intermediate_entry;
typedef struct mph_tables_t {
    _intermediate_entry intermediate[TABLE_SIZE];
    uint64_t values[TABLE_SIZE];
} mph_tables_t;

static const mph_tables_t mph_tables = {
    {  // intermediate
%(intermediate)s
    },
    {  // values
%(values)s
    }
};
"""

tables = gen_minimal_perfect_hash(dictionary)
size = len(dictionary)
cbool = ['false, ', 'true,  ']
perline = lambda i: zip(*([i] * 10))
entries = (f'{{{cbool[e[0]]}{e[1]:#06x}}}' for e in tables[0])
intermediate = indent(',\n'.join([', '.join(group) for group in perline(entries)]), ' ' * 8)
entries = (format(v, '#018x') for v in tables[1])
values = indent(',\n'.join([', '.join(group) for group in perline(entries)]), ' ' * 8)

with open('mph_generated_table.h', 'w') as generated:
    generated.write(template % locals())

其中 dictionary 是您的输入 table。

使用 gcc -O3 编译哈希函数是内联的(循环展开)并且整个 mph_lookup 函数以 300 CPU 条指令计时。循环遍历我生成的所有 50k 个随机密钥的快速基准测试显示我的 2.9 GHz Intel Core i7 笔记本电脑每秒可以为这些密钥查找 5000 万个值(每个密钥 0.02 微秒)。