为numpy矩阵的每一行获取最大频率元素的最快方法

Question

给定一个 2d numpy 矩阵，X 的形状为 [m,n]，其所有值保证为 0 到 9 之间的整数，包括 0 和 9，我希望为每一行计算在该特定行中出现最频繁的值行（打破平局，return 最大值），并输出这个长度为 m 的最大值数组。一个简短的例子如下：

X = [[1,2,3,4],
     [0,0,6,9],
     [5,7,7,5],
     [1,0,0,0],
     [1,8,1,8]]

上述矩阵的输出应该是：

y = [4,0,7,0,8]

考虑第一行 - 所有元素都以相同的频率出现，因此频率最高的数值最大值为 4。在第二行中，只有一个数字 0 的频率最高。在第三行中，5 和 7 都出现了两次，因此选择 7，依此类推。

我可以通过为每一行维护 collections.Counter 个对象然后选择满足条件的数字来做到这一点。我尝试过的一个天真的实现：

from collections import Counter 
X = np.array([[1,2,3,4],[0,0,6,9],[5,7,7,5],[1,0,0,0],[1,8,1,8]])
y = np.zeros(len(X), dtype=int)

for i in range (len(X)):
    freq_count = Counter (X[i])
    max_freq, max_freq_val = 0, -1
    for val in range (10):
        if (freq_count.get(val, 0) >= max_freq):
            max_freq = freq_count.get(val, 0)
            max_freq_val = val
    y[i] = max_freq_val

print (y) #prints [4 0 7 0 8]

但是使用计数器不够快。是否可以改善运行时间？也许还使用矢量化？假设 m = O(5e4) 和 n = 45.

Answer 1

鉴于数字始终是 0 到 9 之间的整数，您可以使用 numpy.bincount to count the number of occurrences, then use numpy.argmax 查找最后一次出现（使用反向视图 [::-1]）：

import numpy as np

X = np.array([[1, 2, 3, 4],
              [0, 0, 6, 9],
              [5, 7, 7, 5],
              [1, 0, 0, 0],
              [1, 8, 1, 8]])

res = [9 - np.bincount(row, minlength=10)[::-1].argmax() for row in X]
print(res)

输出

[4, 0, 7, 0, 8]

根据时间安排 np.bincount is pretty fast. For more details on using argmax to find the last occurrence of the max value read this

为numpy矩阵的每一行获取最大频率元素的最快方法

Fastest way to get max frequency element for every row of numpy matrix

python

counter

numpy

matrix