如何对 numpy 数组的列进行二值化？

Question

我想对 numpy 数组的列进行二进制编码（不是单热编码）：

a = np.array([[2, 3, 5], [4, 6, 8], [3, 7, 9]], dtype=np.uint8)

输出：

>>>print("np.unpackbits(a,axis=1):\n{}".format(np.unpackbits(a, axis=1)))
>>>np.unpackbits(a,axis=1):
[[0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 0 1]]

非常接近我的要求，但是输出数组中有一些多余的零，这不是我想要的。

例如，对于第一列：[2,4,3]，它在输出中被编码为 [0 0 0 0 0 0 1 0, 0 0 0 0 0 1 0 0, 0 0 0 0 0 0 1 1]。但是，将其编码为 [0 1 0,1 0 0,0 1 1].

就足够了

我想知道是否存在一些现成的代码或 sklearn/numpy 模块能够将上述数组转换为没有冗余零的二进制表示形式。

提前致谢！

Answer 1

现成的你没有这么好的东西。如果您愿意接受少量的额外位（我们希望因子为 8），您可以通过最少的额外努力获得您想要的结果。

# Extract the even-numbered columns.
# Take each uint8 of the form 0000abcd and transform it into abcd0000.
x = a[:,::2]<<4

# The if-statement handles a shape mismatch for odd column counts.
# Take each of the abcd0000 uint8s we just created and add an adjacent
# 0000efgh value to it to get abcdefgh.
if a.shape[1]&1:
    x[:,:-1] += a[:,1::2]
else:
    x += a[:,1::2]

# As long as every element was small enough that there wasn't overflow,
# we just shrank the array by half. Pack it as before.
np.unpackbits(x, axis=1)

由于以下几个原因，使用最少的位数稍微有点困难：

您需要计算最小位宽。不过这只是 int(a.max()).bit_length()，所以不要太担心这一步。
您需要将每个 8 位整数打包到新的 B 位通道中。如果 B 不是 8 的因数，那将是相当痛苦的，因为单个输入整数将跨越多个输出整数。

就其他解决方案而言，您总是会招致大量运行时间开销并自行创建解压缩数组。下面的代码绝对不是最优的（并且可能不是运行 -- 有人应该测试它），但它应该足够简单来说明这个想法。

def bits(x, n):
    def _foo(x):
        for _ in range(n):
            yield x&1
            x >>= 1
    return reversed(_foo(x))

def unpack(L, n):
    for x in L:
        yield from bits(x, n)

width = int(a.max()).bit_length()
result = np.array([list(unpack(L, width)) for L in a], dtype=bool)

Answer 2

如果你想保留numpy数组结构，你必须将所有解包后的数字截断到相同的位数。您可以使用以下公式确定所需的位数：

>>> a = np.array([[2, 3, 5], [4, 6, 8], [3, 7, 9]], dtype=np.uint8)
>>> nbits = int(np.floor(np.log2(np.max(a)))+1)
>>> nbits
4

您可以创建从 8 到 nbits 的 np.unpackedbits 的截断版本，方法是首先在新轴上解包，截断，然后重塑您想要的形状：

>>> np.unpackbits(a[...,np.newaxis], axis=2)[:,:,8-nbits:].reshape(a.shape[0],-1)
array([[0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1],
       [0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0],
       [0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1]], dtype=uint8)

获得此结果的另一种方法是过滤 np.unpackedbits 列的结果：

>>> u = np.unpackbits(a, axis=1)
>>> u[:,[i for i in range(u.shape[1]) if i%8>=8-nbits]]
array([[0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1],
       [0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0],
       [0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1]], dtype=uint8)

最后但同样重要的是，如果你想删除所有只包含零的列，你可以使用：

>>> u[:,u.any(axis=0)]
array([[0, 1, 0, 0, 1, 1, 0, 1, 1],
       [1, 0, 0, 1, 1, 0, 1, 0, 0],
       [0, 1, 1, 1, 1, 1, 1, 0, 1]], dtype=uint8)

编辑

如果您希望确定每列的最大位数：

>>> nbits = (np.floor(np.log2(np.max(a,axis=0)))+1).astype('int')
>>> nbits
array([3, 3, 4])

然后您可以过滤相对于此向量的列：

>>> u = np.unpackbits(a, axis=1)
>>> u[:,[i for i in range(u.shape[1]) if i%8>=8-nbits[i//8]]]
array([[0, 1, 0, 0, 1, 1, 0, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 0, 0, 0],
       [0, 1, 1, 1, 1, 1, 1, 0, 0, 1]], dtype=uint8)

这应该可以完成工作！

如何对 numpy 数组的列进行二值化？

How to binarize the columns of a numpy array?

python

arrays

encoding

numpy

data-processing

编辑