为稀疏 64 位无符号整数创建最小完美哈希
Create Minimum perfect hash for sparse 64-bit unsigned integer
我需要一个 64 位到 16 位的完美散列函数来处理稀疏填充的键列表。
我在 python 中有一本字典,它有 48326 个长度为 64 位的键。我想为这个键列表创建一个最小的完美散列。 (我不想等几天才能计算 MPH,所以我也可以将其映射到 16 位哈希)
objective最终将这个字典移植到C中,作为一个数组,其中包含dict值,索引是通过以key为输入的最小完美哈希函数计算的。我无法在正在构建的应用程序的 C 端口中使用外部哈希库
问题:
是否有任何 python 库将我的密钥作为输入并向我提供散列参数和(基于用于散列的定义算法)作为输出。
我找到了一个库 perfection 2.0.0,但由于我的密钥是 64 位形式,所以它挂了。 (即使我在 2000 个键的子集上测试它)
编辑
正如评论中所建议的,我查看了 Steve Hanov's Algo and modified the hash function to take a 64 bit integer (changing values of the FNV Prime and offset as per this wiki page)
虽然我得到了结果,但不幸的是 Maps return -ve 索引值虽然我可以让它工作,但这意味着我必须通过检查 -ve 索引 [= 来向散列计算添加另外 4 个周期12=]
想避免这种情况
就个人而言,我会用 gperf
, or for a large number of keys, with CMPH 生成一个 table,然后完成它。
如果你必须在 Python 中执行此操作,那么我发现 this blog post 和一些 Python 2 代码可以非常有效地转换 string 使用中间人 table 将密钥转换为最小的完美哈希。
根据您的要求调整 post 中的代码,在 0.35 秒内生成 50k 项目的最小完美散列:
>>> import random
>>> testdata = {random.randrange(2**64): random.randrange(2**64)
... for __ in range(50000)} # 50k random 64-bit keys
>>> import timeit
>>> timeit.timeit('gen_minimal_perfect_hash(testdata)', 'from __main__ import gen_minimal_perfect_hash, testdata', number=10)
3.461486832005903
我所做的修改:
- 我切换到 Python 3,遵循 Python 风格指南并使代码更 Pythonic
- 我正在使用
int.to_bytes()
将 64 位无符号整数转换为 8 字节字节串
- 我没有存储负数,而是使用一个标志来区分中间值中的直接输入值和哈希输入值 table
适配代码:
# Easy Perfect Minimal Hashing
# By Steve Hanov. Released to the public domain.
# Adapted to Python 3 best practices and 64-bit integer keys by Martijn Pieters
#
# Based on:
# Edward A. Fox, Lenwood S. Heath, Qi Fan Chen and Amjad M. Daoud,
# "Practical minimal perfect hash functions for large databases", CACM, 35(1):105-121
# also a good reference:
# Compress, Hash, and Displace algorithm by Djamal Belazzougui,
# Fabiano C. Botelho, and Martin Dietzfelbinger
from itertools import count, groupby
def fnv_hash_int(value, size, d=0x811c9dc5):
"""Calculates a distinct hash function for a given 64-bit integer.
Each value of the integer d results in a different hash value. The return
value is the modulus of the hash and size.
"""
# Use the FNV algorithm from http://isthe.com/chongo/tech/comp/fnv/
# The unsigned integer is first converted to a 8-character byte string.
for c in value.to_bytes(8, 'big'):
d = ((d * 0x01000193) ^ c) & 0xffffffff
return d % size
def gen_minimal_perfect_hash(dictionary, _hash_func=fnv_hash_int):
"""Computes a minimal perfect hash table using the given Python dictionary.
It returns a tuple (intermediate, values). intermediate and values are both
lists. intermediate contains the intermediate table of indices needed to
compute the index of the value in values; a tuple of (flag, d) is stored, where
d is either a direct index, or the input for another call to the hash function.
values contains the values of the dictionary.
"""
size = len(dictionary)
# Step 1: Place all of the keys into buckets
buckets = [[] for __ in dictionary]
intermediate = [(False, 0)] * size
values = [None] * size
for key in dictionary:
buckets[_hash_func(key, size)].append(key)
# Step 2: Sort the buckets and process the ones with the most items first.
buckets.sort(key=len, reverse=True)
# Only look at buckets of length greater than 1 first; partitioned produces
# groups of buckets of lengths > 1, then those of length 1, then the empty
# buckets (we ignore the last group).
partitioned = (g for k, g in groupby(buckets, key=lambda b: len(b) != 1))
for bucket in next(partitioned, ()):
# Try increasing values of d until we find a hash function
# that places all items in this bucket into free slots
for d in count(1):
slots = {}
for key in bucket:
slot = _hash_func(key, size, d=d)
if values[slot] is not None or slot in slots:
break
slots[slot] = dictionary[key]
else:
# all slots filled, update the values table; False indicates
# these values are inputs into the hash function
intermediate[_hash_func(bucket[0], size)] = (False, d)
for slot, value in slots.items():
values[slot] = value
break
# The next group is buckets with only 1 item. Process them more quickly by
# directly placing them into a free slot.
freelist = (i for i, value in enumerate(values) if value is None)
for bucket, slot in zip(next(partitioned, ()), freelist):
# These are 'direct' slot references
intermediate[_hash_func(bucket[0], size)] = (True, slot)
values[slot] = dictionary[bucket[0]]
return (intermediate, values)
def perfect_hash_lookup(key, intermediate, values, _hash_func=fnv_hash_int):
"Look up a value in the hash table defined by intermediate and values"
direct, d = intermediate[_hash_func(key, len(intermediate))]
return values[d if direct else _hash_func(key, len(values), d=d)]
上面生成了两个列表,每个列表有 50k 个条目;第一个 table 中的值是 (boolean, integer)
元组,整数在 [0, tablesize)
范围内(理论上值的范围可达 2^16 但如果它曾经出现我会感到非常惊讶进行了 65k+ 次尝试来为您的数据找到一个插槽安排)。您的 table 大小小于 50k,因此上述安排可以将条目存储在该列表中的 4 个字节中(bool
和 short
为 3,但 alignment rules 添加一个字节填充)将其表示为 C 数组时。
快速测试以查看散列 table 是否正确并再次生成正确的输出:
>>> tables = gen_minimal_perfect_hash(testdata)
>>> for key, value in testdata.items():
... assert perfect_hash_lookup(key, *tables) == value
...
只需要在C:
中实现查找功能即可
fnv_hash_int
操作可以获取一个指向 64 位整数的指针,然后将该指针转换为一个 8 位值的数组,并将索引递增 8 次以访问每个单独的字节;使用 suitable function to ensure big-endian (network) order.
- 您不需要在 C 语言中使用
0xffffffff
进行屏蔽,因为 C 整数值的溢出无论如何都会被自动丢弃。
len(intermediate) == len(values) == len(dictionary)
并且可以在常量中捕获。
- 假设 C99,将中间 table 存储为结构类型数组,其中
flag
为 bool
,d
为无符号 short
;这只是 3 个字节,加上 1 个填充字节以在 4 字节边界上对齐。 values
数组中的数据类型取决于输入字典中的值。
如果你原谅我的 C 技能,这里有一个示例实现:
mph_table.h
#include "mph_generated_table.h"
#include <arpa/inet.h>
#include <stdint.h>
#ifndef htonll
// see
#define htonll(x) ((1==htonl(1)) ? (x) : ((uint64_t)htonl((x) & 0xFFFFFFFF) << 32) | htonl((x) >> 32))
#endif
uint64_t mph_lookup(uint64_t key);
mph_table.c
#include "mph_table.h"
#include <stdbool.h>
#include <stdint.h>
#define FNV_OFFSET 0x811c9dc5
#define FNV_PRIME 0x01000193
uint32_t fnv_hash_modulo_table(uint32_t d, uint64_t key) {
d = (d == 0) ? FNV_OFFSET : d;
uint8_t* keybytes = (uint8_t*)&key;
for (int i = 0; i < 8; ++i) {
d = (d * FNV_PRIME) ^ keybytes[i];
}
return d % TABLE_SIZE;
}
uint64_t mph_lookup(uint64_t key) {
_intermediate_entry entry =
mph_tables.intermediate[fnv_hash_modulo_table(0, htonll(key))];
return mph_tables.values[
entry.flag ?
entry.d :
fnv_hash_modulo_table((uint32_t)entry.d, htonll(key))];
}
这将依赖于生成的头文件,生成自:
from textwrap import indent
template = """\
#include <stdbool.h>
#include <stdint.h>
#define TABLE_SIZE %(size)s
typedef struct _intermediate_entry {
bool flag;
uint16_t d;
} _intermediate_entry;
typedef struct mph_tables_t {
_intermediate_entry intermediate[TABLE_SIZE];
uint64_t values[TABLE_SIZE];
} mph_tables_t;
static const mph_tables_t mph_tables = {
{ // intermediate
%(intermediate)s
},
{ // values
%(values)s
}
};
"""
tables = gen_minimal_perfect_hash(dictionary)
size = len(dictionary)
cbool = ['false, ', 'true, ']
perline = lambda i: zip(*([i] * 10))
entries = (f'{{{cbool[e[0]]}{e[1]:#06x}}}' for e in tables[0])
intermediate = indent(',\n'.join([', '.join(group) for group in perline(entries)]), ' ' * 8)
entries = (format(v, '#018x') for v in tables[1])
values = indent(',\n'.join([', '.join(group) for group in perline(entries)]), ' ' * 8)
with open('mph_generated_table.h', 'w') as generated:
generated.write(template % locals())
其中 dictionary
是您的输入 table。
使用 gcc -O3
编译哈希函数是内联的(循环展开)并且整个 mph_lookup
函数以 300 CPU 条指令计时。循环遍历我生成的所有 50k 个随机密钥的快速基准测试显示我的 2.9 GHz Intel Core i7 笔记本电脑每秒可以为这些密钥查找 5000 万个值(每个密钥 0.02 微秒)。
我需要一个 64 位到 16 位的完美散列函数来处理稀疏填充的键列表。
我在 python 中有一本字典,它有 48326 个长度为 64 位的键。我想为这个键列表创建一个最小的完美散列。 (我不想等几天才能计算 MPH,所以我也可以将其映射到 16 位哈希)
objective最终将这个字典移植到C中,作为一个数组,其中包含dict值,索引是通过以key为输入的最小完美哈希函数计算的。我无法在正在构建的应用程序的 C 端口中使用外部哈希库
问题: 是否有任何 python 库将我的密钥作为输入并向我提供散列参数和(基于用于散列的定义算法)作为输出。
我找到了一个库 perfection 2.0.0,但由于我的密钥是 64 位形式,所以它挂了。 (即使我在 2000 个键的子集上测试它)
编辑 正如评论中所建议的,我查看了 Steve Hanov's Algo and modified the hash function to take a 64 bit integer (changing values of the FNV Prime and offset as per this wiki page)
虽然我得到了结果,但不幸的是 Maps return -ve 索引值虽然我可以让它工作,但这意味着我必须通过检查 -ve 索引 [= 来向散列计算添加另外 4 个周期12=]
想避免这种情况
就个人而言,我会用 gperf
, or for a large number of keys, with CMPH 生成一个 table,然后完成它。
如果你必须在 Python 中执行此操作,那么我发现 this blog post 和一些 Python 2 代码可以非常有效地转换 string 使用中间人 table 将密钥转换为最小的完美哈希。
根据您的要求调整 post 中的代码,在 0.35 秒内生成 50k 项目的最小完美散列:
>>> import random
>>> testdata = {random.randrange(2**64): random.randrange(2**64)
... for __ in range(50000)} # 50k random 64-bit keys
>>> import timeit
>>> timeit.timeit('gen_minimal_perfect_hash(testdata)', 'from __main__ import gen_minimal_perfect_hash, testdata', number=10)
3.461486832005903
我所做的修改:
- 我切换到 Python 3,遵循 Python 风格指南并使代码更 Pythonic
- 我正在使用
int.to_bytes()
将 64 位无符号整数转换为 8 字节字节串
- 我没有存储负数,而是使用一个标志来区分中间值中的直接输入值和哈希输入值 table
适配代码:
# Easy Perfect Minimal Hashing
# By Steve Hanov. Released to the public domain.
# Adapted to Python 3 best practices and 64-bit integer keys by Martijn Pieters
#
# Based on:
# Edward A. Fox, Lenwood S. Heath, Qi Fan Chen and Amjad M. Daoud,
# "Practical minimal perfect hash functions for large databases", CACM, 35(1):105-121
# also a good reference:
# Compress, Hash, and Displace algorithm by Djamal Belazzougui,
# Fabiano C. Botelho, and Martin Dietzfelbinger
from itertools import count, groupby
def fnv_hash_int(value, size, d=0x811c9dc5):
"""Calculates a distinct hash function for a given 64-bit integer.
Each value of the integer d results in a different hash value. The return
value is the modulus of the hash and size.
"""
# Use the FNV algorithm from http://isthe.com/chongo/tech/comp/fnv/
# The unsigned integer is first converted to a 8-character byte string.
for c in value.to_bytes(8, 'big'):
d = ((d * 0x01000193) ^ c) & 0xffffffff
return d % size
def gen_minimal_perfect_hash(dictionary, _hash_func=fnv_hash_int):
"""Computes a minimal perfect hash table using the given Python dictionary.
It returns a tuple (intermediate, values). intermediate and values are both
lists. intermediate contains the intermediate table of indices needed to
compute the index of the value in values; a tuple of (flag, d) is stored, where
d is either a direct index, or the input for another call to the hash function.
values contains the values of the dictionary.
"""
size = len(dictionary)
# Step 1: Place all of the keys into buckets
buckets = [[] for __ in dictionary]
intermediate = [(False, 0)] * size
values = [None] * size
for key in dictionary:
buckets[_hash_func(key, size)].append(key)
# Step 2: Sort the buckets and process the ones with the most items first.
buckets.sort(key=len, reverse=True)
# Only look at buckets of length greater than 1 first; partitioned produces
# groups of buckets of lengths > 1, then those of length 1, then the empty
# buckets (we ignore the last group).
partitioned = (g for k, g in groupby(buckets, key=lambda b: len(b) != 1))
for bucket in next(partitioned, ()):
# Try increasing values of d until we find a hash function
# that places all items in this bucket into free slots
for d in count(1):
slots = {}
for key in bucket:
slot = _hash_func(key, size, d=d)
if values[slot] is not None or slot in slots:
break
slots[slot] = dictionary[key]
else:
# all slots filled, update the values table; False indicates
# these values are inputs into the hash function
intermediate[_hash_func(bucket[0], size)] = (False, d)
for slot, value in slots.items():
values[slot] = value
break
# The next group is buckets with only 1 item. Process them more quickly by
# directly placing them into a free slot.
freelist = (i for i, value in enumerate(values) if value is None)
for bucket, slot in zip(next(partitioned, ()), freelist):
# These are 'direct' slot references
intermediate[_hash_func(bucket[0], size)] = (True, slot)
values[slot] = dictionary[bucket[0]]
return (intermediate, values)
def perfect_hash_lookup(key, intermediate, values, _hash_func=fnv_hash_int):
"Look up a value in the hash table defined by intermediate and values"
direct, d = intermediate[_hash_func(key, len(intermediate))]
return values[d if direct else _hash_func(key, len(values), d=d)]
上面生成了两个列表,每个列表有 50k 个条目;第一个 table 中的值是 (boolean, integer)
元组,整数在 [0, tablesize)
范围内(理论上值的范围可达 2^16 但如果它曾经出现我会感到非常惊讶进行了 65k+ 次尝试来为您的数据找到一个插槽安排)。您的 table 大小小于 50k,因此上述安排可以将条目存储在该列表中的 4 个字节中(bool
和 short
为 3,但 alignment rules 添加一个字节填充)将其表示为 C 数组时。
快速测试以查看散列 table 是否正确并再次生成正确的输出:
>>> tables = gen_minimal_perfect_hash(testdata)
>>> for key, value in testdata.items():
... assert perfect_hash_lookup(key, *tables) == value
...
只需要在C:
中实现查找功能即可fnv_hash_int
操作可以获取一个指向 64 位整数的指针,然后将该指针转换为一个 8 位值的数组,并将索引递增 8 次以访问每个单独的字节;使用 suitable function to ensure big-endian (network) order.- 您不需要在 C 语言中使用
0xffffffff
进行屏蔽,因为 C 整数值的溢出无论如何都会被自动丢弃。 len(intermediate) == len(values) == len(dictionary)
并且可以在常量中捕获。- 假设 C99,将中间 table 存储为结构类型数组,其中
flag
为bool
,d
为无符号short
;这只是 3 个字节,加上 1 个填充字节以在 4 字节边界上对齐。values
数组中的数据类型取决于输入字典中的值。
如果你原谅我的 C 技能,这里有一个示例实现:
mph_table.h
#include "mph_generated_table.h"
#include <arpa/inet.h>
#include <stdint.h>
#ifndef htonll
// see
#define htonll(x) ((1==htonl(1)) ? (x) : ((uint64_t)htonl((x) & 0xFFFFFFFF) << 32) | htonl((x) >> 32))
#endif
uint64_t mph_lookup(uint64_t key);
mph_table.c
#include "mph_table.h"
#include <stdbool.h>
#include <stdint.h>
#define FNV_OFFSET 0x811c9dc5
#define FNV_PRIME 0x01000193
uint32_t fnv_hash_modulo_table(uint32_t d, uint64_t key) {
d = (d == 0) ? FNV_OFFSET : d;
uint8_t* keybytes = (uint8_t*)&key;
for (int i = 0; i < 8; ++i) {
d = (d * FNV_PRIME) ^ keybytes[i];
}
return d % TABLE_SIZE;
}
uint64_t mph_lookup(uint64_t key) {
_intermediate_entry entry =
mph_tables.intermediate[fnv_hash_modulo_table(0, htonll(key))];
return mph_tables.values[
entry.flag ?
entry.d :
fnv_hash_modulo_table((uint32_t)entry.d, htonll(key))];
}
这将依赖于生成的头文件,生成自:
from textwrap import indent
template = """\
#include <stdbool.h>
#include <stdint.h>
#define TABLE_SIZE %(size)s
typedef struct _intermediate_entry {
bool flag;
uint16_t d;
} _intermediate_entry;
typedef struct mph_tables_t {
_intermediate_entry intermediate[TABLE_SIZE];
uint64_t values[TABLE_SIZE];
} mph_tables_t;
static const mph_tables_t mph_tables = {
{ // intermediate
%(intermediate)s
},
{ // values
%(values)s
}
};
"""
tables = gen_minimal_perfect_hash(dictionary)
size = len(dictionary)
cbool = ['false, ', 'true, ']
perline = lambda i: zip(*([i] * 10))
entries = (f'{{{cbool[e[0]]}{e[1]:#06x}}}' for e in tables[0])
intermediate = indent(',\n'.join([', '.join(group) for group in perline(entries)]), ' ' * 8)
entries = (format(v, '#018x') for v in tables[1])
values = indent(',\n'.join([', '.join(group) for group in perline(entries)]), ' ' * 8)
with open('mph_generated_table.h', 'w') as generated:
generated.write(template % locals())
其中 dictionary
是您的输入 table。
使用 gcc -O3
编译哈希函数是内联的(循环展开)并且整个 mph_lookup
函数以 300 CPU 条指令计时。循环遍历我生成的所有 50k 个随机密钥的快速基准测试显示我的 2.9 GHz Intel Core i7 笔记本电脑每秒可以为这些密钥查找 5000 万个值(每个密钥 0.02 微秒)。