为另一个列表中列表的每个元素的存在生成掩码数组
Generating mask array for existence of each element of a list in another list
您认为有更快的方法吗?
或者在运行时和内存方面更有效的方式。
>>> list1 = ['a', 'b', 'c', 'd']
>>> list2 = ['b', 'c']
>>> mask_array = [True if x in list2 else False for x in list1]
>>> mask_array
[False, True, True, False]
list1 = ['a', 'b', 'c']
list2 = ['b', 'c']
set2 = set(list2)
mask_array = [x in set2 for x in list1]
集合的查找操作平均成本为 O(1),远低于列表中的查找操作 O(n)。
在这里你可以看到差异,这是巨大的:
from time import time
import random
import numpy as np
random.seed(7)
list1 = [random.randrange(1000000) for i in range(100000)]
list2 = [random.randrange(1000000) for i in range(100000)]
start = time()
mask_array = [True if x in list2 else False for x in list1]
stop = time()
print(stop - start) # 93.71739292144775
start = time()
set2 = set(list2)
mask_array = [True if x in set2 else False for x in list1]
stop = time()
print(stop - start) # 0.022114992141723633
start = time()
mask_array = np.isin(list1, list2)
stop = time()
print(stop - start) # 0.03964031219482422
90 秒 vs <1 秒!!
在这种情况下,您可以看到我的解决方案甚至比 np.isin
解决方案更快。
由于您标记了 numpy
,您可以使用 np.isin
获得掩码以获得更高性能的方法:
>>> list1=['a','b','c','d']
>>> list2=['b','c']
>>> np.isin(list1, list2)
>>> array([False, True, True, False])
时间,
a = np.random.randint(0,200_000, 100_000)
b = np.random.randint(0,10_000, 10_000)
%timeit np.isin(a,b)
# 8.78 ms ± 192 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
set2 = set(b)
mask_array = [x in set2 for x in a]
# 15.9 ms ± 606 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
您认为有更快的方法吗? 或者在运行时和内存方面更有效的方式。
>>> list1 = ['a', 'b', 'c', 'd']
>>> list2 = ['b', 'c']
>>> mask_array = [True if x in list2 else False for x in list1]
>>> mask_array
[False, True, True, False]
list1 = ['a', 'b', 'c']
list2 = ['b', 'c']
set2 = set(list2)
mask_array = [x in set2 for x in list1]
集合的查找操作平均成本为 O(1),远低于列表中的查找操作 O(n)。
在这里你可以看到差异,这是巨大的:
from time import time
import random
import numpy as np
random.seed(7)
list1 = [random.randrange(1000000) for i in range(100000)]
list2 = [random.randrange(1000000) for i in range(100000)]
start = time()
mask_array = [True if x in list2 else False for x in list1]
stop = time()
print(stop - start) # 93.71739292144775
start = time()
set2 = set(list2)
mask_array = [True if x in set2 else False for x in list1]
stop = time()
print(stop - start) # 0.022114992141723633
start = time()
mask_array = np.isin(list1, list2)
stop = time()
print(stop - start) # 0.03964031219482422
90 秒 vs <1 秒!!
在这种情况下,您可以看到我的解决方案甚至比 np.isin
解决方案更快。
由于您标记了 numpy
,您可以使用 np.isin
获得掩码以获得更高性能的方法:
>>> list1=['a','b','c','d']
>>> list2=['b','c']
>>> np.isin(list1, list2)
>>> array([False, True, True, False])
时间,
a = np.random.randint(0,200_000, 100_000)
b = np.random.randint(0,10_000, 10_000)
%timeit np.isin(a,b)
# 8.78 ms ± 192 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
set2 = set(b)
mask_array = [x in set2 for x in a]
# 15.9 ms ± 606 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)