给定两个列表的匹配元素的索引,其中一个列表具有冗余条目
Indices of matching elements given two lists of which one has redundant entries
我有两个列表,a
和 b
。 a
包含我想知道 b
中匹配元素索引的元素。在 b
中,每个元素都是唯一的,这与 a
中不同。
a = [1993, 1993, 1994, 1995, 1996, 1996, 1998, 2003, 2005, 2005]
b = [1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014]
使用来自Finding the indices of matching elements in list in Python的解决方案:
matching = [match for match, element in enumerate(b) if element in a]
matching
然而只是 [27, 28, 29, 30, 32, 37, 39]
,但我希望它是 [27, 27, 28, 29, 30, 30, 32, 37, 39, 39]
.
>>> a = [1993, 1993, 1994, 1995, 1996, 1996, 1998, 2003, 2005, 2005]
>>> b = [1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014]
>>> [b.index(x) for x in a]
[27, 27, 28, 29, 30, 30, 32, 37, 39, 39]
那
呢
print [b.index(i) for i in a if i in b]
如果你的列表很大,将 b 设为一个集合会更有效率:
st = set(b)
print([b.index(x) for x in a if x in st])
由于您的数据已排序并假设来自 a 的所有元素都在 b 中,您还可以使用 bisect 因此每个索引查找都是 O(log n):
a = [1993, 1993, 1994, 1995, 1996, 1996, 1998, 2003, 2005, 2005]
b = [1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014]
from bisect import bisect_left
print [bisect_left(b, x) for x in a]
[27, 27, 28, 29, 30, 30, 32, 37, 39, 39]
在小型数据集上,它 运行 的速度是索引的两倍:
In [22]: timeit [bisect_left(b, x) for x in a]
100000 loops, best of 3: 4.2 µs per loop
In [23]: timeit [b.index(x) for x in a]
100000 loops, best of 3: 8.84 µs per loop
另一种选择是使用字典来存储索引,这意味着代码将 运行 在线性时间内,一次通过 a 一次通过 b:
# store all indexes as values and years as keys
indexes = {k: i for i, k in enumerate(b)}
# one pass over a accessing each index in constant time
print [indexes[x] for x in a]
[27, 27, 28, 29, 30, 30, 32, 37, 39, 39]
即使在小输入集上也比索引更有效,并且随着 a 的增长会更有效:
In [34]: %%timeit
indexes = {k: i for i, k in enumerate(b)}
[indexes[x] for x in a]
....:
100000 loops, best of 3: 7.54 µs per loop
In [39]: b = list(range(1966,2100))
In [40]: samp = list(range(1966,2100))
In [41]: a = [choice(samp) for _ in range(100)]
In [42]: timeit [b.index(x) for x in a
10000 loops, best of 3: 154 µs per loop
In [43]: %%timeit
indexes = {k: i for i, k in enumerate(b)}
[indexes[x] for x in a]
....:
10000 loops, best of 3: 22.5 µs per loop
这是 Padraic Cunningham 的扩展。相反,如果将要索引的列表转换为字典,则可以实现 O(1) 查找,用于 O(n) 预处理:
a = [1993, 1993, 1994, 1995, 1996, 1996, 1998, 2003, 2005, 2005]
b = [1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014]
d = {value: index for index, value in enumerate(b)}
print([d[x] for x in a])
>>> timeit("[bisect_left(b, x) for x in a]", "from __main__ import a, b; from bisect import bisect_left")
3.513558427607279
>>> timeit("[b.index(x) for x in a]", "from __main__ import a, b")
8.010070997323822
>>> timeit("d = {value: index for index, value in enumerate(b)}; [d[x] for x in a]", "from __main__ import a, b")
5.5277420695707065
>>> timeit("[d[x] for x in a]", "from __main__ import a, b, ;d = {value : index for index, value in enumerate(b)}")
1.1214096146165389
所以,如果你不考虑预处理,你在实际处理中比使用 b.index
快将近 8 倍 - 如果你正在做很多列表 a
,这会更好反对更少的 b
。如果只使用一次 bisect_left
会更快,并且可以保证 b
是单调递增的。
我有两个列表,a
和 b
。 a
包含我想知道 b
中匹配元素索引的元素。在 b
中,每个元素都是唯一的,这与 a
中不同。
a = [1993, 1993, 1994, 1995, 1996, 1996, 1998, 2003, 2005, 2005]
b = [1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014]
使用来自Finding the indices of matching elements in list in Python的解决方案:
matching = [match for match, element in enumerate(b) if element in a]
matching
然而只是 [27, 28, 29, 30, 32, 37, 39]
,但我希望它是 [27, 27, 28, 29, 30, 30, 32, 37, 39, 39]
.
>>> a = [1993, 1993, 1994, 1995, 1996, 1996, 1998, 2003, 2005, 2005]
>>> b = [1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014]
>>> [b.index(x) for x in a]
[27, 27, 28, 29, 30, 30, 32, 37, 39, 39]
那
呢print [b.index(i) for i in a if i in b]
如果你的列表很大,将 b 设为一个集合会更有效率:
st = set(b)
print([b.index(x) for x in a if x in st])
由于您的数据已排序并假设来自 a 的所有元素都在 b 中,您还可以使用 bisect 因此每个索引查找都是 O(log n):
a = [1993, 1993, 1994, 1995, 1996, 1996, 1998, 2003, 2005, 2005]
b = [1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014]
from bisect import bisect_left
print [bisect_left(b, x) for x in a]
[27, 27, 28, 29, 30, 30, 32, 37, 39, 39]
在小型数据集上,它 运行 的速度是索引的两倍:
In [22]: timeit [bisect_left(b, x) for x in a]
100000 loops, best of 3: 4.2 µs per loop
In [23]: timeit [b.index(x) for x in a]
100000 loops, best of 3: 8.84 µs per loop
另一种选择是使用字典来存储索引,这意味着代码将 运行 在线性时间内,一次通过 a 一次通过 b:
# store all indexes as values and years as keys
indexes = {k: i for i, k in enumerate(b)}
# one pass over a accessing each index in constant time
print [indexes[x] for x in a]
[27, 27, 28, 29, 30, 30, 32, 37, 39, 39]
即使在小输入集上也比索引更有效,并且随着 a 的增长会更有效:
In [34]: %%timeit
indexes = {k: i for i, k in enumerate(b)}
[indexes[x] for x in a]
....:
100000 loops, best of 3: 7.54 µs per loop
In [39]: b = list(range(1966,2100))
In [40]: samp = list(range(1966,2100))
In [41]: a = [choice(samp) for _ in range(100)]
In [42]: timeit [b.index(x) for x in a
10000 loops, best of 3: 154 µs per loop
In [43]: %%timeit
indexes = {k: i for i, k in enumerate(b)}
[indexes[x] for x in a]
....:
10000 loops, best of 3: 22.5 µs per loop
这是 Padraic Cunningham
a = [1993, 1993, 1994, 1995, 1996, 1996, 1998, 2003, 2005, 2005]
b = [1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014]
d = {value: index for index, value in enumerate(b)}
print([d[x] for x in a])
>>> timeit("[bisect_left(b, x) for x in a]", "from __main__ import a, b; from bisect import bisect_left")
3.513558427607279
>>> timeit("[b.index(x) for x in a]", "from __main__ import a, b")
8.010070997323822
>>> timeit("d = {value: index for index, value in enumerate(b)}; [d[x] for x in a]", "from __main__ import a, b")
5.5277420695707065
>>> timeit("[d[x] for x in a]", "from __main__ import a, b, ;d = {value : index for index, value in enumerate(b)}")
1.1214096146165389
所以,如果你不考虑预处理,你在实际处理中比使用 b.index
快将近 8 倍 - 如果你正在做很多列表 a
,这会更好反对更少的 b
。如果只使用一次 bisect_left
会更快,并且可以保证 b
是单调递增的。