Python 优化如何在列表中查找重复值和值索引
Python optimize how to find duplicate value and value index in a list
我有一个包含 18 000 个唯一 ID 的列表。
ID 是字母 A, B, C, D
的串联。
我制作了一个代码,按 ID[0:-1]
对 ID 进行分组,并给出重复 ID 的索引位置。
效果不错,但要进行的时间很长:大约 110 secs
到 18 000 ID
。
您有加快我的代码速度的想法吗?
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']
startTime = time.time()
b = [i[0:-1] for i in a]
b = list(set(b))
result = range(len(b))
it = 0
for i in result:
result[i] = [b[i], []]
for j in xrange(len(a)):
if b[i] == a[j][0:-1]:
result[i][1].append(j)
endTime = time.time()
print endTime - startTime, 'secs !'
输出:
>>> [['1CDABCABD', [0, 1, 2]], ['1DDAABBBB', [4]], ['1BCABCCCA', [3, 5]]]
这就是 python 中的 groupby 有效的做法:
from itertools import groupby
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']
key = lambda i: a[i][:-1]
indexes = sorted(range(len(a)), key=key)
result = [[x, list(y)] for x, y in groupby(indexes, key=key)]
输出:
[['1BCABCCCA', [3, 5]], ['1CDABCABD', [0, 1, 2]], ['1DDAABBBB', [4]]]
作为解决此类问题的更 Pythonic 方法,请使用 collections.defaultdict
:
>>> from collections import defaultdict
>>> d=defaultdict(list)
>>> new=[i[:-1] for i in a]
>>> d=defaultdict(list)
>>> for i,j in enumerate(new):
... d[j].append(i)
...
>>> d
defaultdict(<type 'list'>, {'1CDABCABD': [0, 1, 2], '1DDAABBBB': [4], '1BCABCCCA': [3, 5]})
>>> d.items()
[('1CDABCABD', [0, 1, 2]), ('1DDAABBBB', [4]), ('1BCABCCCA', [3, 5])]
注意 defaultdict
是一个线性解决方案,比 itertools.groupby
和 sorted
更有效。
你也可以只使用dict.setdefault
方法:
>>> d={}
>>> for i,j in enumerate(new):
... d.setdefault(j,[]).append(i)
...
>>> d
{'1CDABCABD': [0, 1, 2], '1DDAABBBB': [4], '1BCABCCCA': [3, 5]}
有关更多详细信息,请查看以下基准测试其 ~4X 更快:
s1="""
from itertools import groupby
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']
key = lambda i: a[i][:-1]
indexes = sorted(range(len(a)), key=key)
result = [[x, list(y)] for x, y in groupby(indexes, key=key)]
"""
s2="""
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']
new=[i[:-1] for i in a]
d={}
for i,j in enumerate(new):
d.setdefault(j,[]).append(i)
d.items()
"""
print ' first: ' ,timeit(stmt=s1, number=100000)
print 'second : ',timeit(stmt=s2, number=100000)
结果:
first: 0.949549913406
second : 0.250894069672
不使用其他模块的替代解决方案:
grouped = {}
for i, j in enumerate(a):
itm = grouped.get(j[0:-1], [])
itm.append(i)
grouped[j[0:-1]] = itm
print [[k, v] for k, v in grouped.items()] # [['1CDABCABD', [0, 1, 2]], ['1DDAABBBB', [4]], ['1BCABCCCA', [3, 5]]]
你在找这个吗:
>>> d = {}
>>> for ind, elem in enumerate(a):
... d.setdefault(elem[0:-1], []).append(ind)
>>> print d
{'1CDABCABD': [0, 1, 2], '1DDAABBBB': [4], '1BCABCCCA': [3, 5]}
该解决方案与 Kasra 的优化代码非常相似,但运行速度稍快。区别在于切片的位置,但不确定为什么一个比另一个表现稍好:
s1 = """
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA',
'1DDAABBBBA', '1BCABCCCAD']
d = {}
for ind, elem in enumerate(a):
d.setdefault(elem[0:-1], []).append(ind)
"""
s2="""
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']
new=[i[:-1] for i in a]
d={}
for i,j in enumerate(new):
d.setdefault(j,[]).append(i)
"""
print 'Kasra's time/my time: %s' % (str(timeit(stmt=s2, number=100000)/timeit(stmt=s1, number=100000))
Kasra's time/my time: 1.24058060531
我有一个包含 18 000 个唯一 ID 的列表。
ID 是字母 A, B, C, D
的串联。
我制作了一个代码,按 ID[0:-1]
对 ID 进行分组,并给出重复 ID 的索引位置。
效果不错,但要进行的时间很长:大约 110 secs
到 18 000 ID
。
您有加快我的代码速度的想法吗?
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']
startTime = time.time()
b = [i[0:-1] for i in a]
b = list(set(b))
result = range(len(b))
it = 0
for i in result:
result[i] = [b[i], []]
for j in xrange(len(a)):
if b[i] == a[j][0:-1]:
result[i][1].append(j)
endTime = time.time()
print endTime - startTime, 'secs !'
输出:
>>> [['1CDABCABD', [0, 1, 2]], ['1DDAABBBB', [4]], ['1BCABCCCA', [3, 5]]]
这就是 python 中的 groupby 有效的做法:
from itertools import groupby
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']
key = lambda i: a[i][:-1]
indexes = sorted(range(len(a)), key=key)
result = [[x, list(y)] for x, y in groupby(indexes, key=key)]
输出:
[['1BCABCCCA', [3, 5]], ['1CDABCABD', [0, 1, 2]], ['1DDAABBBB', [4]]]
作为解决此类问题的更 Pythonic 方法,请使用 collections.defaultdict
:
>>> from collections import defaultdict
>>> d=defaultdict(list)
>>> new=[i[:-1] for i in a]
>>> d=defaultdict(list)
>>> for i,j in enumerate(new):
... d[j].append(i)
...
>>> d
defaultdict(<type 'list'>, {'1CDABCABD': [0, 1, 2], '1DDAABBBB': [4], '1BCABCCCA': [3, 5]})
>>> d.items()
[('1CDABCABD', [0, 1, 2]), ('1DDAABBBB', [4]), ('1BCABCCCA', [3, 5])]
注意 defaultdict
是一个线性解决方案,比 itertools.groupby
和 sorted
更有效。
你也可以只使用dict.setdefault
方法:
>>> d={}
>>> for i,j in enumerate(new):
... d.setdefault(j,[]).append(i)
...
>>> d
{'1CDABCABD': [0, 1, 2], '1DDAABBBB': [4], '1BCABCCCA': [3, 5]}
有关更多详细信息,请查看以下基准测试其 ~4X 更快:
s1="""
from itertools import groupby
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']
key = lambda i: a[i][:-1]
indexes = sorted(range(len(a)), key=key)
result = [[x, list(y)] for x, y in groupby(indexes, key=key)]
"""
s2="""
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']
new=[i[:-1] for i in a]
d={}
for i,j in enumerate(new):
d.setdefault(j,[]).append(i)
d.items()
"""
print ' first: ' ,timeit(stmt=s1, number=100000)
print 'second : ',timeit(stmt=s2, number=100000)
结果:
first: 0.949549913406
second : 0.250894069672
不使用其他模块的替代解决方案:
grouped = {}
for i, j in enumerate(a):
itm = grouped.get(j[0:-1], [])
itm.append(i)
grouped[j[0:-1]] = itm
print [[k, v] for k, v in grouped.items()] # [['1CDABCABD', [0, 1, 2]], ['1DDAABBBB', [4]], ['1BCABCCCA', [3, 5]]]
你在找这个吗:
>>> d = {}
>>> for ind, elem in enumerate(a):
... d.setdefault(elem[0:-1], []).append(ind)
>>> print d
{'1CDABCABD': [0, 1, 2], '1DDAABBBB': [4], '1BCABCCCA': [3, 5]}
该解决方案与 Kasra 的优化代码非常相似,但运行速度稍快。区别在于切片的位置,但不确定为什么一个比另一个表现稍好:
s1 = """
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA',
'1DDAABBBBA', '1BCABCCCAD']
d = {}
for ind, elem in enumerate(a):
d.setdefault(elem[0:-1], []).append(ind)
"""
s2="""
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']
new=[i[:-1] for i in a]
d={}
for i,j in enumerate(new):
d.setdefault(j,[]).append(i)
"""
print 'Kasra's time/my time: %s' % (str(timeit(stmt=s2, number=100000)/timeit(stmt=s1, number=100000))
Kasra's time/my time: 1.24058060531