加快在 numpy 数组中查找索引
Speed up indices finding in numpy array
我有一个名为 ops
的 1d numpy 字符串数组 (dtype='U')
,长度为 15MM,我需要在其中找到所有索引,其中我找到了一个名为 op
的字符串 83,000 次。
到目前为止 numpy 赢得了比赛,但仍然需要大约 3 个小时:indices = np.where(ops==op)
我也试过 np.unravel_index(np.where(ops.ravel()==op), ops.shape)[0][0]
没有太大区别。
我正在尝试使用与原始数据相似的随机数据的 cython 方法,但它比 numpys 解决方案慢大约 40 倍。这是我的第一个 cython 代码,也许我可以改进它。
赛通代码:
import numpy as np
cimport numpy as np
def get_ixs(np.ndarray data, str x, np.ndarray[int,mode="c",ndim=1] xind):
cdef int count, n, i
count = 0
n = data.shape[0]
i = 0
while i < n:
if (data[i] == x):
xind[count] = i
count += 1
i += 1
return xind[0:count]
如果您使用相同的 data
多次调用 get_ixs
,最快的解决方案是将 data
预处理为 dict
,然后进行 O(1) 次查找(常数时间)查询字符串时。
字典的一个键是一个字符串 x
,这个键的值是一个包含满足 data[i] == x
.
的索引的列表
这是代码:
import numpy as np
data = np.array(["toto", "titi", "toto", "titi", "tutu"])
indices = np.arange(len(data))
# sort data so that we can construct the dict by replacing list with ndarray as soon as possible (when string changes) to reduce memory usage
indices_data_sorted = np.argsort(data)
data = data[indices_data_sorted]
indices = indices[indices_data_sorted]
# construct the dict str -> ndarray of indices (use ndarray for lower memory consumption)
dict_str_to_indices = dict()
prev_str = None
list_idx = [] # list to hold the indices for a given string
for i, s in zip(indices, data):
if s != prev_str:
# the current string has changed so we can construct the ndarray and store it in the dict
if prev_str is not None:
dict_str_to_indices[prev_str] = np.array(list_idx, dtype="int32")
list_idx.clear()
prev_str = s
list_idx.append(i)
dict_str_to_indices[s] = np.array(list_idx, dtype="int32") # add the ndarray for last string
def get_ixs(dict_str_to_indices: dict, x: str):
return dict_str_to_indices[x]
print(get_ixs(dict_str_to_indices, "toto"))
print(get_ixs(dict_str_to_indices, "titi"))
print(get_ixs(dict_str_to_indices, "tutu"))
输出:
[0 2]
[1 3]
[4]
如果用同一个dict_str_to_indices
多次调用get_ixs
,就是最优渐近解(O(1)查找)。
我有一个名为 ops
的 1d numpy 字符串数组 (dtype='U')
,长度为 15MM,我需要在其中找到所有索引,其中我找到了一个名为 op
的字符串 83,000 次。
到目前为止 numpy 赢得了比赛,但仍然需要大约 3 个小时:indices = np.where(ops==op)
我也试过 np.unravel_index(np.where(ops.ravel()==op), ops.shape)[0][0]
没有太大区别。
我正在尝试使用与原始数据相似的随机数据的 cython 方法,但它比 numpys 解决方案慢大约 40 倍。这是我的第一个 cython 代码,也许我可以改进它。 赛通代码:
import numpy as np
cimport numpy as np
def get_ixs(np.ndarray data, str x, np.ndarray[int,mode="c",ndim=1] xind):
cdef int count, n, i
count = 0
n = data.shape[0]
i = 0
while i < n:
if (data[i] == x):
xind[count] = i
count += 1
i += 1
return xind[0:count]
如果您使用相同的 data
多次调用 get_ixs
,最快的解决方案是将 data
预处理为 dict
,然后进行 O(1) 次查找(常数时间)查询字符串时。
字典的一个键是一个字符串 x
,这个键的值是一个包含满足 data[i] == x
.
的索引的列表
这是代码:
import numpy as np
data = np.array(["toto", "titi", "toto", "titi", "tutu"])
indices = np.arange(len(data))
# sort data so that we can construct the dict by replacing list with ndarray as soon as possible (when string changes) to reduce memory usage
indices_data_sorted = np.argsort(data)
data = data[indices_data_sorted]
indices = indices[indices_data_sorted]
# construct the dict str -> ndarray of indices (use ndarray for lower memory consumption)
dict_str_to_indices = dict()
prev_str = None
list_idx = [] # list to hold the indices for a given string
for i, s in zip(indices, data):
if s != prev_str:
# the current string has changed so we can construct the ndarray and store it in the dict
if prev_str is not None:
dict_str_to_indices[prev_str] = np.array(list_idx, dtype="int32")
list_idx.clear()
prev_str = s
list_idx.append(i)
dict_str_to_indices[s] = np.array(list_idx, dtype="int32") # add the ndarray for last string
def get_ixs(dict_str_to_indices: dict, x: str):
return dict_str_to_indices[x]
print(get_ixs(dict_str_to_indices, "toto"))
print(get_ixs(dict_str_to_indices, "titi"))
print(get_ixs(dict_str_to_indices, "tutu"))
输出:
[0 2]
[1 3]
[4]
如果用同一个dict_str_to_indices
多次调用get_ixs
,就是最优渐近解(O(1)查找)。