为另一个列表中列表的每个元素的存在生成掩码数组

Generating mask array for existence of each element of a list in another list

您认为有更快的方法吗? 或者在运行时和内存方面更有效的方式。

>>> list1 = ['a', 'b', 'c', 'd']
>>> list2 = ['b', 'c']
>>> mask_array = [True if x in list2 else False for x in list1]
>>> mask_array
[False, True, True, False]
list1 = ['a', 'b', 'c']
list2 = ['b', 'c']

set2 = set(list2)
mask_array = [x in set2 for x in list1]

集合的查找操作平均成本为 O(1),远低于列表中的查找操作 O(n)。

在这里你可以看到差异,这是巨大的:

from time import time
import random
import numpy as np

random.seed(7)

list1 = [random.randrange(1000000) for i in range(100000)]
list2 = [random.randrange(1000000) for i in range(100000)]

start = time()
mask_array = [True if x in list2 else False for x in list1]
stop = time()
print(stop - start) # 93.71739292144775

start = time()
set2 = set(list2)
mask_array = [True if x in set2 else False for x in list1]
stop = time()
print(stop - start) # 0.022114992141723633

start = time()
mask_array = np.isin(list1, list2)
stop = time()
print(stop - start) # 0.03964031219482422

90 秒 vs <1 秒!!

在这种情况下,您可以看到我的解决方案甚至比 np.isin 解决方案更快。

由于您标记了 numpy,您可以使用 np.isin 获得掩码以获得更高性能的方法:

>>> list1=['a','b','c','d']
>>> list2=['b','c']

>>> np.isin(list1, list2)
>>> array([False,  True,  True, False])

时间,

a = np.random.randint(0,200_000, 100_000)
b = np.random.randint(0,10_000, 10_000)

%timeit np.isin(a,b)
# 8.78 ms ± 192 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
set2 = set(b)
mask_array = [x in set2 for x in a]
# 15.9 ms ± 606 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)