Numpy:计算数组中索引出现次数的最佳方法
Numpy: Optimal way to count indexs occurrence in an array
我有一个数组indexs
。它很长 (>10k),并且每个 int 值都相当小 (<100)。例如
indexs = np.array([1, 4, 3, 0, 0, 1, 2, 0]) # int index array
indexs_max = 4 # already known
现在我想计算每个索引值的出现次数(例如 0 表示 3 次,1 表示 2 次...),并得到 counts
为 np.array([3, 2, 1, 1, 1])
。我测试了以下4种方法:
UPDATE
: _test4
是@ 的解:
indexs = np.random.randint(0, 10, (20000,))
indexs_max = 9
def _test1():
counts = np.zeros((indexs_max + 1, ), dtype=np.int32)
for ind in indexs:
counts[ind] += 1
return counts
def _test2():
counts = np.zeros((indexs_max + 1,), dtype=np.int32)
uniq_vals, uniq_cnts = np.unique(indexs, return_counts=True)
counts[uniq_vals] = uniq_cnts
# this is because some value in range may be missing
return counts
def _test3():
therange = np.arange(0, indexs_max + 1)
counts = np.sum(indexs[None] == therange[:, None], axis=1)
return counts
def _test4():
return np.bincount(indexs, minlength=indexs_max+1)
运行500次,他们的用时分别为32.499472856521606s
、0.31386804580688477s
、0.14069509506225586s
、0.017721891403198242s
。 虽然_test3
是最快的,但它会占用额外的大内存。
所以我要求任何更好的方法。谢谢你:) (@)
UPDATE
: np.bincount
目前看来是最优的。
您可以使用 np.bincount
来计算数组中出现的次数。
indexs = np.array([1, 4, 3, 0, 0, 1, 2, 0])
np.bincount(indexs)
# array([3, 2, 1, 1, 1])
# 0's 1's 2's 3's 4's count
有一个注意事项np.bincount(x).size == np.amax(x)+1
Example:
indexs = np.array([5, 10])
np.bincount(indexs)
# array([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1])
# 5's 10's count
Here's it would count occurrences of 0 to the max in the array, a workaround can be
c = np.bincount(indexs) # indexs is [5, 10]
c = c[c>0]
# array([1, 1])
# 5's 10's count
If you have no missing values from i.e from 0
to your_max
you can use np.bincount
.
另一个警告:
来自文档:
Count the number of occurrences of each value in an array of non-negative ints.
我有一个数组indexs
。它很长 (>10k),并且每个 int 值都相当小 (<100)。例如
indexs = np.array([1, 4, 3, 0, 0, 1, 2, 0]) # int index array
indexs_max = 4 # already known
现在我想计算每个索引值的出现次数(例如 0 表示 3 次,1 表示 2 次...),并得到 counts
为 np.array([3, 2, 1, 1, 1])
。我测试了以下4种方法:
UPDATE
: _test4
是@
indexs = np.random.randint(0, 10, (20000,))
indexs_max = 9
def _test1():
counts = np.zeros((indexs_max + 1, ), dtype=np.int32)
for ind in indexs:
counts[ind] += 1
return counts
def _test2():
counts = np.zeros((indexs_max + 1,), dtype=np.int32)
uniq_vals, uniq_cnts = np.unique(indexs, return_counts=True)
counts[uniq_vals] = uniq_cnts
# this is because some value in range may be missing
return counts
def _test3():
therange = np.arange(0, indexs_max + 1)
counts = np.sum(indexs[None] == therange[:, None], axis=1)
return counts
def _test4():
return np.bincount(indexs, minlength=indexs_max+1)
运行500次,他们的用时分别为32.499472856521606s
、0.31386804580688477s
、0.14069509506225586s
、0.017721891403198242s
。 虽然_test3
是最快的,但它会占用额外的大内存。
所以我要求任何更好的方法。谢谢你:) (@
UPDATE
: np.bincount
目前看来是最优的。
您可以使用 np.bincount
来计算数组中出现的次数。
indexs = np.array([1, 4, 3, 0, 0, 1, 2, 0])
np.bincount(indexs)
# array([3, 2, 1, 1, 1])
# 0's 1's 2's 3's 4's count
有一个注意事项np.bincount(x).size == np.amax(x)+1
Example:
indexs = np.array([5, 10]) np.bincount(indexs) # array([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1]) # 5's 10's count
Here's it would count occurrences of 0 to the max in the array, a workaround can be
c = np.bincount(indexs) # indexs is [5, 10] c = c[c>0] # array([1, 1]) # 5's 10's count
If you have no missing values from i.e from
0
toyour_max
you can usenp.bincount
.
另一个警告:
来自文档:
Count the number of occurrences of each value in an array of non-negative ints.