寻找使用 numpy 根据出现次数对 3d 数组进行下采样的最快方法

Question

给定类型为 'uint8' 的大型 3d numpy 数组（不会太大而无法放入内存），我想在每个维度中使用给定的比例因子缩小此数组。您可以假设数组的形状可以被比例因子整除。

数组的值在 [0, 1, ... max] 中，其中 max 总是小于 6。我想按比例缩小它，这样每个形状为 "scale_factor" 的 3d 块将return 在此块中出现次数最多的数字。当等于 return 第一个（我不在乎）。

我试过以下有效的方法

import numpy as np

array = np.random.randint(0, 4, ((128, 128, 128)), dtype='uint8')
scale_factor = (4, 4, 4)
bincount = 3

# Reshape to free dimension of size scale_factor to apply scaledown method to
m, n, r = np.array(array.shape) // scale_factor
array = array.reshape((m, scale_factor[0], n, scale_factor[1], r, scale_factor[2]))


# Making histogram, first over last axis, then sum over other two
array = np.apply_along_axis(lambda x: np.bincount(x, minlength=bincount),
                            axis=5, arr=array)
array = np.apply_along_axis(lambda x: np.sum(x), axis=3, arr=array)
array = np.apply_along_axis(lambda x: np.sum(x), axis=1, arr=array).astype('uint8')

array = np.argmax(array , axis=3)

这有效，但是 bincount 非常慢。也让 np.histogram 工作，但也很慢。我确实认为我尝试的这两种方法都不完全是为我的目的而设计的，它们提供了更多的功能来减慢方法的速度。

我的问题是，有谁知道更快的方法吗？如果有人能指出深度学习库中的一种方法可以做到这一点，我也会很高兴，但这不是正式的问题。

Answer 1

好吧，这是一个类似的方法，但速度更快。它仅根据您的用例将 bincount 函数替换为更简单的实现：lambda x: max(set(x), key=lambda y: list(x).count(y)) 首先对数组进行整形，以便可以直接在一维上使用该方法。

在我的 128x128x128 笔记本电脑上，速度快了大约 4 倍：

import time
import numpy as np

array = np.random.randint(0, 4, ((128, 128, 128)), dtype='uint8')
scale_factor = (4, 4, 4)
bincount = 4

start_time = time.time()
N = 10
for i in range(N):

    # Reshape to free dimension of size scale_factor to apply scaledown method to
    m, n, r = np.array(array.shape) // scale_factor
    arr = array.reshape((m, scale_factor[0], n, scale_factor[1], r, scale_factor[2]))
    arr = np.swapaxes(arr, 1, 2).swapaxes(2, 4)
    arr = arr.reshape((m, n, r, np.prod(scale_factor)))

    # Obtain the element that occurred the most
    arr = np.apply_along_axis(lambda x: max(set(x), key=lambda y: list(x).count(y)),
                              axis=3, arr=arr)

print((time.time() - start_time) / N)

与例如 np.mean()

等内置方法仍有很大差距

Answer 2

@F.Wessels 正在朝着正确的方向思考，但答案还不完全存在。如果自己动手而不是沿轴应用，速度可以提高一百倍以上。

首先，当您将 3D 数组 space 分成块时，您的尺寸从 128x128x128 变为 32x4x32x4x32x4。试着直观地理解这一点：你实际上有 32x32x32 个大小为 4x4x4 的块。不是将块保持为 4x4x4，而是将它们压缩为 64 大小，从那里可以找到最常见的项目。这就是诀窍：如果您的块不是排列为 32x32x32x64 而是排列为 32768x64 也没有关系。基本上，我们已经回到了二维维度，在那里一切都变得更容易了。

现在使用大小为 32768x64 的二维数组，您可以使用列表理解和 numpy ops 自己进行 bin 计数；它会快10倍。

import time
import numpy as np

array = np.random.randint(0, 4, ((128, 128, 128)), dtype='uint8')
scale_factor = (4, 4, 4)
bincount = 4

def prev_func(array):
    # Reshape to free dimension of size scale_factor to apply scaledown method to
    m, n, r = np.array(array.shape) // scale_factor
    arr = array.reshape((m, scale_factor[0], n, scale_factor[1], r, scale_factor[2]))
    arr = np.swapaxes(arr, 1, 2).swapaxes(2, 4)
    arr = arr.reshape((m, n, r, np.prod(scale_factor)))
    # Obtain the element that occurred the most
    arr = np.apply_along_axis(lambda x: max(set(x), key=lambda y: list(x).count(y)),
                              axis=3, arr=arr)
    return arr

def new_func(array):
    # Reshape to free dimension of size scale_factor to apply scaledown method to
    m, n, r = np.array(array.shape) // scale_factor
    arr = array.reshape((m, scale_factor[0], n, scale_factor[1], r, scale_factor[2]))
    arr = np.swapaxes(arr, 1, 2).swapaxes(2, 4)
    arr = arr.reshape((m, n, r, np.prod(scale_factor)))
    # Collapse dimensions
    arr = arr.reshape(-1,np.prod(scale_factor))
    # Get blockwise frequencies -> Get most frequent items
    arr = np.array([(arr==b).sum(axis=1) for b in range(bincount)]).argmax(axis=0)
    arr = arr.reshape((m,n,r))
    return arr

N = 10

start1 = time.time()
for i in range(N):
    out1 = prev_func(array)
end1 = time.time()
print('Prev:',(end1-start1)/N)

start2 = time.time()
for i in range(N):
    out2 = new_func(array)
end2 = time.time()
print('New:',(end2-start2)/N)

print('Difference:',(out1-out2).sum())

输出：

Prev: 1.4244404077529906
New: 0.01667332649230957
Difference: 0

如您所见，即使我调整了尺寸，结果也没有差异。当我转到 2D 时，Numpy 的 reshape 函数保持了值的顺序，因为保留了最后一个维度 64。当我重塑回 m、n、r 时，这个顺序被重新建立。你给出的原始方法在我的机器上花了大约 5 秒运行，所以根据经验，速度提高了五百倍。

寻找使用 numpy 根据出现次数对 3d 数组进行下采样的最快方法

Looking for fastest method to downsample a 3d array based on occurences using numpy

python

numpy

downsampling

numpy-ndarray