在大型 numpy 数组上使用布尔掩码非常慢
Using a boolean Mask on large numpy array is very slow
我在使用 python 编码时遇到性能问题。
假设我有 2 个非常大的字符串数组 (Nx2),比如 N = 12,000,000,还有两个变量 label_a 和 label_b,它们也是字符串。这是以下代码:
import numpy as np
import time
indices = np.array([np.random.choice(np.arange(5000).astype(str),size=10000000),np.random.choice(np.arange(5000).astype(str),size=10000000)]).T
costs = np.random.uniform(size=10000000)
label_a = '2'
label_b = '9'
t0 = time.time()
costs = costs[(indices[:,0]!=label_a)*(indices[:,0]!=label_b)*(indices[:,1]!=label_a)*(indices[:,1]!=label_b)]
indices = indices[(indices[:,0]!=label_a)*(indices[:,0]!=label_b)*(indices[:,1]!=label_a)*(indices[:,1]!=label_b)]
t1 = time.time()
toseq = t1-t0
print(toseq)
以上代码段每次运行需要3秒。我想在降低计算成本的同时实现同样的目标:
我正在使用布尔掩码仅检索值不是 label_a 和 label_b
的 costs 和 indices 数组中的行
如评论中所述,只需计算一次您所追求的索引的值,并将它们组合一次即可节省时间。
(我也改变了计时方式,只是为了简洁 - 结果是一样的)
import numpy as np
from timeit import timeit
r = 5000
n = 10000000
indices = np.array([
np.random.choice(np.arange(r).astype(str), size=n),
np.random.choice(np.arange(r).astype(str), size=n)
]).T
costs = np.random.uniform(size=n)
label_a = '2'
label_b = '9'
n_indices = np.array([
np.random.choice(np.arange(r), size=n),
np.random.choice(np.arange(r), size=n)
]).T
def run():
global indices
global costs
_ = costs[(indices[:, 0] != label_a)*(indices[:, 0] != label_b) *
(indices[:, 1] != label_a)*(indices[:, 1] != label_b)]
_ = indices[(indices[:, 0] != label_a)*(indices[:, 0] != label_b) *
(indices[:, 1] != label_a)*(indices[:, 1] != label_b)]
def run_faster():
global indices
global costs
# only compute these only once
not_a0 = indices[:, 0] != label_a
not_b0 = indices[:, 0] != label_b
not_a1 = indices[:, 1] != label_a
not_b1 = indices[:, 1] != label_b
_ = costs[not_a0 * not_b0 * not_a1 * not_b1]
_ = indices[not_a0 * not_b0 * not_a1 * not_b1]
def run_even_faster():
global indices
global costs
# also combine them only once
cond = ((indices[:, 0] != label_a) * (indices[:, 0] != label_b) *
(indices[:, 1] != label_a) * (indices[:, 1] != label_b))
_ = costs[cond]
_ = indices[cond]
def run_sep_mask():
global indices
global costs
global cond
# just the masking part of run_even_faster
cond = ((indices[:, 0] != label_a) * (indices[:, 0] != label_b) *
(indices[:, 1] != label_a) * (indices[:, 1] != label_b))
def run_sep_index():
global indices
global costs
global cond
# just the indexing part of run_even_faster
_ = costs[cond]
_ = indices[cond]
def run_even_faster_numerical():
global indices
global costs
# use int values and n_indices instead of indices
a = int(label_a)
b = int(label_b)
cond = ((n_indices[:, 0] != a) * (n_indices[:, 0] != b) *
(n_indices[:, 1] != a) * (n_indices[:, 1] != b))
_ = costs[cond]
_ = indices[cond]
def run_all(funcs):
for f in funcs:
print('{:.4f} : {}()'.format(timeit(f, number=1), f.__name__))
run_all([run, run_faster, run_even_faster, run_sep_mask, run_sep_index, run_even_faster_numerical])
请注意,我还添加了一个示例,其中操作不是基于字符串,而是基于数字。如果您可以避免值是字符串,而是获取数字,您也会获得性能提升。
如果您开始比较更长的标签,这种提升会变得很明显 - 最后,如果字符串足够长,甚至可能值得在过滤之前将字符串转换为数字。
这些是我的结果:
0.9711 : run()
0.7065 : run_faster()
0.6983 : run_even_faster()
0.2657 : run_sep_mask()
0.4174 : run_sep_index()
0.4536 : run_even_faster_numerical()
两个 sep
条目显示索引大约是为 run_even_faster
构建掩码所需时间的两倍.
但是,他们还表明,在进行实际索引之后,基于整数构建掩码的时间不到 0.04 秒,而基于字符串构建掩码的时间约为 0.26 秒。所以,这就是你需要改进的地方。
我在使用 python 编码时遇到性能问题。 假设我有 2 个非常大的字符串数组 (Nx2),比如 N = 12,000,000,还有两个变量 label_a 和 label_b,它们也是字符串。这是以下代码:
import numpy as np
import time
indices = np.array([np.random.choice(np.arange(5000).astype(str),size=10000000),np.random.choice(np.arange(5000).astype(str),size=10000000)]).T
costs = np.random.uniform(size=10000000)
label_a = '2'
label_b = '9'
t0 = time.time()
costs = costs[(indices[:,0]!=label_a)*(indices[:,0]!=label_b)*(indices[:,1]!=label_a)*(indices[:,1]!=label_b)]
indices = indices[(indices[:,0]!=label_a)*(indices[:,0]!=label_b)*(indices[:,1]!=label_a)*(indices[:,1]!=label_b)]
t1 = time.time()
toseq = t1-t0
print(toseq)
以上代码段每次运行需要3秒。我想在降低计算成本的同时实现同样的目标: 我正在使用布尔掩码仅检索值不是 label_a 和 label_b
的 costs 和 indices 数组中的行如评论中所述,只需计算一次您所追求的索引的值,并将它们组合一次即可节省时间。
(我也改变了计时方式,只是为了简洁 - 结果是一样的)
import numpy as np
from timeit import timeit
r = 5000
n = 10000000
indices = np.array([
np.random.choice(np.arange(r).astype(str), size=n),
np.random.choice(np.arange(r).astype(str), size=n)
]).T
costs = np.random.uniform(size=n)
label_a = '2'
label_b = '9'
n_indices = np.array([
np.random.choice(np.arange(r), size=n),
np.random.choice(np.arange(r), size=n)
]).T
def run():
global indices
global costs
_ = costs[(indices[:, 0] != label_a)*(indices[:, 0] != label_b) *
(indices[:, 1] != label_a)*(indices[:, 1] != label_b)]
_ = indices[(indices[:, 0] != label_a)*(indices[:, 0] != label_b) *
(indices[:, 1] != label_a)*(indices[:, 1] != label_b)]
def run_faster():
global indices
global costs
# only compute these only once
not_a0 = indices[:, 0] != label_a
not_b0 = indices[:, 0] != label_b
not_a1 = indices[:, 1] != label_a
not_b1 = indices[:, 1] != label_b
_ = costs[not_a0 * not_b0 * not_a1 * not_b1]
_ = indices[not_a0 * not_b0 * not_a1 * not_b1]
def run_even_faster():
global indices
global costs
# also combine them only once
cond = ((indices[:, 0] != label_a) * (indices[:, 0] != label_b) *
(indices[:, 1] != label_a) * (indices[:, 1] != label_b))
_ = costs[cond]
_ = indices[cond]
def run_sep_mask():
global indices
global costs
global cond
# just the masking part of run_even_faster
cond = ((indices[:, 0] != label_a) * (indices[:, 0] != label_b) *
(indices[:, 1] != label_a) * (indices[:, 1] != label_b))
def run_sep_index():
global indices
global costs
global cond
# just the indexing part of run_even_faster
_ = costs[cond]
_ = indices[cond]
def run_even_faster_numerical():
global indices
global costs
# use int values and n_indices instead of indices
a = int(label_a)
b = int(label_b)
cond = ((n_indices[:, 0] != a) * (n_indices[:, 0] != b) *
(n_indices[:, 1] != a) * (n_indices[:, 1] != b))
_ = costs[cond]
_ = indices[cond]
def run_all(funcs):
for f in funcs:
print('{:.4f} : {}()'.format(timeit(f, number=1), f.__name__))
run_all([run, run_faster, run_even_faster, run_sep_mask, run_sep_index, run_even_faster_numerical])
请注意,我还添加了一个示例,其中操作不是基于字符串,而是基于数字。如果您可以避免值是字符串,而是获取数字,您也会获得性能提升。
如果您开始比较更长的标签,这种提升会变得很明显 - 最后,如果字符串足够长,甚至可能值得在过滤之前将字符串转换为数字。
这些是我的结果:
0.9711 : run()
0.7065 : run_faster()
0.6983 : run_even_faster()
0.2657 : run_sep_mask()
0.4174 : run_sep_index()
0.4536 : run_even_faster_numerical()
两个 sep
条目显示索引大约是为 run_even_faster
构建掩码所需时间的两倍.
但是,他们还表明,在进行实际索引之后,基于整数构建掩码的时间不到 0.04 秒,而基于字符串构建掩码的时间约为 0.26 秒。所以,这就是你需要改进的地方。