获得整数数组汉明距离的最快方法
Fastest way to get hamming distance for integer array
令 a 和 b 为具有相同大小的 8 位整数 (0-255) 的向量。我想计算这些向量不同的位数,即由这些数字的二进制表示串联形成的向量之间的汉明距离。例如:
a = [127,255]
b= [127,240]
使用 numpy 库
np.bitwise_xor(a,b)
# Output: array([ 0, 15])
我现在需要的是用二进制表示上述数组的每个元素,并计算数组所有元素中1的个数。上面的示例将给出 0+4 = 4 的汉明距离。在 Python?
中对此有任何快速而优雅的解决方案
也许不是最有效的方法,但我认为最简单的方法是将输出数组转换为二进制形式的字符串,然后将所有字符的总和转换回整数...
import numpy as np
output = np.random.randint(0,63,10)
hamming = ['{:b}'.format(x).count('1') for x in output]
方法 #1: 我们可以将它们广播成二进制位并计算不同位的数量,就像这样 -
def hamming_distance(a, b):
r = (1 << np.arange(8))[:,None]
return np.count_nonzero( (a & r) != (b & r) )
示例 运行 -
In [144]: a = [127,255]
...: b = [127,240]
...:
In [145]: hamming_distance(a, b)
Out[145]: 4
方法#2:使用bitwise-xor
操作,我们可以找出a
和b
之间不同二进制位的个数-
def hamming_distance_v2(a, b):
r = (1 << np.arange(8))[:,None]
return np.count_nonzero((np.bitwise_xor(a,b) & r) != 0)
sum(bin(x).count("1") for x in np.bitwise_xor(a,b))
如果您打算在一次程序执行期间多次调用距离函数,您可以通过使用预先计算的 table 位计数来提高速度。这是汉明距离函数的(又一个)版本:
# _nbits[k] is the number of 1s in the binary representation of k for 0 <= k < 256.
_nbits = np.array(
[0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3,
4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4,
4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2,
3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5,
4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4,
5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3,
3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2,
3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6,
4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5,
6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5,
5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6,
7, 7, 8], dtype=np.uint8)
def hamming_distance1(a, b):
c = np.bitwise_xor(a, b)
n = _nbits[c].sum()
return n
在下文中,a
和 b
是问题评论中给出的长度为 32 的 Python 列表。 divakar_hamming_distance()
和 divakar_hamming_distance_v2()
来自@Divakar 的回答。
@Divakar 的功能时间如下:
In [116]: %timeit divakar_hamming_distance(a, b)
The slowest run took 5.57 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 11.3 µs per loop
In [117]: %timeit divakar_hamming_distance_v2(a, b)
The slowest run took 5.35 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 10.3 µs per loop
hamming_distance1(a, b)
快一点:
In [118]: %timeit hamming_distance1(a, b)
The slowest run took 6.04 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 7.42 µs per loop
在我的电脑上,初始化 _nbits
大约需要 11 微秒,因此如果只调用函数一次,使用 hamming_distance1
没有任何优势。如果您调用它三次或更多次,则性能会有所提高。
如果输入已经是 numpy 数组,则所有函数都明显更快:
In [119]: aa = np.array(a)
In [120]: bb = np.array(b)
In [121]: %timeit divakar_hamming_distance_v2(aa, bb)
The slowest run took 8.22 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.72 µs per loop
In [122]: %timeit hamming_distance1(aa, bb)
The slowest run took 12.67 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.77 µs per loop
当然,如果您总是在计算汉明距离之前立即执行此操作,则进行转换的时间必须包括在总时间中。但是,如果您编写生成 a
和 b
的代码以更早地利用 numpy,则在计算汉明距离时您可能已经将它们作为 numpy 数组。
(我还对 8 位值之间预先计算的汉明距离的二维数组进行了一些实验——形状为 (256, 256) 的数组——但初始化成本较高且性能提升很小.)
令 a 和 b 为具有相同大小的 8 位整数 (0-255) 的向量。我想计算这些向量不同的位数,即由这些数字的二进制表示串联形成的向量之间的汉明距离。例如:
a = [127,255]
b= [127,240]
使用 numpy 库
np.bitwise_xor(a,b)
# Output: array([ 0, 15])
我现在需要的是用二进制表示上述数组的每个元素,并计算数组所有元素中1的个数。上面的示例将给出 0+4 = 4 的汉明距离。在 Python?
中对此有任何快速而优雅的解决方案也许不是最有效的方法,但我认为最简单的方法是将输出数组转换为二进制形式的字符串,然后将所有字符的总和转换回整数...
import numpy as np
output = np.random.randint(0,63,10)
hamming = ['{:b}'.format(x).count('1') for x in output]
方法 #1: 我们可以将它们广播成二进制位并计算不同位的数量,就像这样 -
def hamming_distance(a, b):
r = (1 << np.arange(8))[:,None]
return np.count_nonzero( (a & r) != (b & r) )
示例 运行 -
In [144]: a = [127,255]
...: b = [127,240]
...:
In [145]: hamming_distance(a, b)
Out[145]: 4
方法#2:使用bitwise-xor
操作,我们可以找出a
和b
之间不同二进制位的个数-
def hamming_distance_v2(a, b):
r = (1 << np.arange(8))[:,None]
return np.count_nonzero((np.bitwise_xor(a,b) & r) != 0)
sum(bin(x).count("1") for x in np.bitwise_xor(a,b))
如果您打算在一次程序执行期间多次调用距离函数,您可以通过使用预先计算的 table 位计数来提高速度。这是汉明距离函数的(又一个)版本:
# _nbits[k] is the number of 1s in the binary representation of k for 0 <= k < 256.
_nbits = np.array(
[0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3,
4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4,
4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2,
3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5,
4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4,
5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3,
3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2,
3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6,
4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5,
6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5,
5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6,
7, 7, 8], dtype=np.uint8)
def hamming_distance1(a, b):
c = np.bitwise_xor(a, b)
n = _nbits[c].sum()
return n
在下文中,a
和 b
是问题评论中给出的长度为 32 的 Python 列表。 divakar_hamming_distance()
和 divakar_hamming_distance_v2()
来自@Divakar 的回答。
@Divakar 的功能时间如下:
In [116]: %timeit divakar_hamming_distance(a, b)
The slowest run took 5.57 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 11.3 µs per loop
In [117]: %timeit divakar_hamming_distance_v2(a, b)
The slowest run took 5.35 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 10.3 µs per loop
hamming_distance1(a, b)
快一点:
In [118]: %timeit hamming_distance1(a, b)
The slowest run took 6.04 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 7.42 µs per loop
在我的电脑上,初始化 _nbits
大约需要 11 微秒,因此如果只调用函数一次,使用 hamming_distance1
没有任何优势。如果您调用它三次或更多次,则性能会有所提高。
如果输入已经是 numpy 数组,则所有函数都明显更快:
In [119]: aa = np.array(a)
In [120]: bb = np.array(b)
In [121]: %timeit divakar_hamming_distance_v2(aa, bb)
The slowest run took 8.22 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.72 µs per loop
In [122]: %timeit hamming_distance1(aa, bb)
The slowest run took 12.67 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.77 µs per loop
当然,如果您总是在计算汉明距离之前立即执行此操作,则进行转换的时间必须包括在总时间中。但是,如果您编写生成 a
和 b
的代码以更早地利用 numpy,则在计算汉明距离时您可能已经将它们作为 numpy 数组。
(我还对 8 位值之间预先计算的汉明距离的二维数组进行了一些实验——形状为 (256, 256) 的数组——但初始化成本较高且性能提升很小.)