加快在 numpy 数组中查找索引

Speed up indices finding in numpy array

我有一个名为 ops 的 1d numpy 字符串数组 (dtype='U'),长度为 15MM,我需要在其中找到所有索引,其中我找到了一个名为 op 的字符串 83,000 次。

到目前为止 numpy 赢得了比赛,但仍然需要大约 3 个小时:indices = np.where(ops==op) 我也试过 np.unravel_index(np.where(ops.ravel()==op), ops.shape)[0][0] 没有太大区别。

我正在尝试使用与原始数据相似的随机数据的 cython 方法,但它比 numpys 解决方案慢大约 40 倍。这是我的第一个 cython 代码,也许我可以改进它。 赛通代码:

import numpy as np
cimport numpy as np

def get_ixs(np.ndarray data, str x, np.ndarray[int,mode="c",ndim=1] xind):
    cdef int count, n, i
    count = 0
    n = data.shape[0]
    i = 0
    while i < n:
        if (data[i] == x):
            xind[count] = i
            count += 1
        i += 1

    return xind[0:count]

如果您使用相同的 data 多次调用 get_ixs,最快的解决方案是将 data 预处理为 dict,然后进行 O(1) 次查找(常数时间)查询字符串时。
字典的一个键是一个字符串 x,这个键的值是一个包含满足 data[i] == x.
的索引的列表 这是代码:

import numpy as np

data = np.array(["toto", "titi", "toto", "titi", "tutu"])

indices = np.arange(len(data))
# sort data so that we can construct the dict by replacing list with ndarray as soon as possible (when string changes) to reduce memory usage
indices_data_sorted = np.argsort(data)  
data = data[indices_data_sorted]
indices = indices[indices_data_sorted]

# construct the dict str -> ndarray of indices (use ndarray for lower memory consumption)
dict_str_to_indices = dict()
prev_str = None
list_idx = []  # list to hold the indices for a given string
for i, s in zip(indices, data):
    if s != prev_str:  
        # the current string has changed so we can construct the ndarray and store it in the dict
        if prev_str is not None:
            dict_str_to_indices[prev_str] = np.array(list_idx, dtype="int32")
        list_idx.clear()
        prev_str = s
    list_idx.append(i)
    
dict_str_to_indices[s] = np.array(list_idx, dtype="int32")  # add the ndarray for last string

def get_ixs(dict_str_to_indices: dict, x: str):
    return dict_str_to_indices[x]

print(get_ixs(dict_str_to_indices, "toto"))
print(get_ixs(dict_str_to_indices, "titi"))
print(get_ixs(dict_str_to_indices, "tutu"))

输出:

[0 2]
[1 3]
[4]

如果用同一个dict_str_to_indices多次调用get_ixs,就是最优渐近解(O(1)查找)。