如何在保留顺序和长度的同时从 Python 中的列表中删除重复项?
How do you remove duplicates from a list in Python whilst preserving order and length?
我想做的是从列表中删除重复项,每次删除重复项时插入一个空项。
我有删除重复项的代码。它还会忽略空列表项
import csv
#Create new output file
new_file = open('addr_list_corrected.csv','w')
new_file.close()
with open('addr_list.csv', 'r') as addr_list:
csv_reader = csv.reader(addr_list, delimiter=',')
for row in csv_reader:
print row
print "##########################"
seen=set()
seen_add=seen.add
#empty cell/element evaluates to false
new_row = [ cell for cell in row if not (cell and cell in seen or seen_add(cell))]
print new_row
with open('addr_list_corrected.csv', 'a') as addr_list_corrected:
csv_writer=csv.writer(addr_list_corrected, delimiter=',')
csv_writer.writerow(new_row)
但我需要用空字符串替换每个删除的项目。
我会用迭代器来做。像这样:
def dedup(seq):
seen = set()
for v in seq:
yield '' if v in seen else v
seen.add(v)
编辑:反转逻辑使意思更清楚:
另一种选择是做这样的事情:
seen = dict()
seen_setdefault = seen.setdefault
new_row = ["" if cell in seen else seen_setdefault(cell, cell) for cell in row]
举个例子:
>>> row = ["to", "be", "or", "not", "to", "be"]
>>> seen = dict()
>>> seen_setdefault = seen.setdefault
>>> new_row = ["" if cell in seen else seen_setdefault(cell, cell) for cell in row]
>>> new_row
['to', 'be', 'or', 'not', '', '']
编辑 2: 出于好奇,我 运行 快速测试一下哪种方法最快:
>>> from random import randint
>>> from statistics import mean
>>> from timeit import repeat
>>>
>>> def standard(seq):
... """Trivial modification to standard method for removing duplicates."""
... seen = set()
... seen_add = seen.add
... return ["" if x in seen or seen_add(x) else x for x in seq]
...
>>> def dedup(seq):
... seen = set()
... for v in seq:
... yield '' if v in seen else v
... seen.add(v)
...
>>> def pedro(seq):
... """Pedro's iterator based approach to removing duplicates."""
... my_dedup = dedup
... return [x for x in my_dedup(seq)]
...
>>> def srgerg(seq):
... """Srgerg's dict based approach to removing duplicates."""
... seen = dict()
... seen_setdefault = seen.setdefault
... return ["" if cell in seen else seen_setdefault(cell, cell) for cell in seq]
...
>>> data = [randint(0, 10000) for x in range(100000)]
>>>
>>> mean(repeat("standard(data)", "from __main__ import data, standard", number=100))
1.2130275770426708
>>> mean(repeat("pedro(data)", "from __main__ import data, pedro", number=100))
3.1519048346103555
>>> mean(repeat("srgerg(data)", "from __main__ import data, srgerg", number=100))
1.2611971098676882
从结果可以看出,对this other stack-overflow question中描述的标准方法进行相对简单的修改是最快的。
您可以使用 set
来跟踪看到的项目。使用上面使用的示例列表:
x = ['to', 'be', 'or', 'not', 'to', 'be']
seen = set()
for index, item in enumerate(x):
if item in seen:
x[index] = ''
else:
seen.add(item)
print x
您可以创建一个新列表,如果元素不存在于新列表中,则追加该元素;如果该元素已存在于新列表中,则追加 None。
oldList = [3, 1, 'a', 2, 4, 2, 'a', 5, 1, 3]
newList = []
for i in oldList:
if i in newList:
newList.append(None)
else:
newList.append(i)
print newList
输出:
[3, 1, 'a', 2, 4, None, None, 5, None, None]
我想做的是从列表中删除重复项,每次删除重复项时插入一个空项。
我有删除重复项的代码。它还会忽略空列表项
import csv
#Create new output file
new_file = open('addr_list_corrected.csv','w')
new_file.close()
with open('addr_list.csv', 'r') as addr_list:
csv_reader = csv.reader(addr_list, delimiter=',')
for row in csv_reader:
print row
print "##########################"
seen=set()
seen_add=seen.add
#empty cell/element evaluates to false
new_row = [ cell for cell in row if not (cell and cell in seen or seen_add(cell))]
print new_row
with open('addr_list_corrected.csv', 'a') as addr_list_corrected:
csv_writer=csv.writer(addr_list_corrected, delimiter=',')
csv_writer.writerow(new_row)
但我需要用空字符串替换每个删除的项目。
我会用迭代器来做。像这样:
def dedup(seq):
seen = set()
for v in seq:
yield '' if v in seen else v
seen.add(v)
编辑:反转逻辑使意思更清楚:
另一种选择是做这样的事情:
seen = dict()
seen_setdefault = seen.setdefault
new_row = ["" if cell in seen else seen_setdefault(cell, cell) for cell in row]
举个例子:
>>> row = ["to", "be", "or", "not", "to", "be"]
>>> seen = dict()
>>> seen_setdefault = seen.setdefault
>>> new_row = ["" if cell in seen else seen_setdefault(cell, cell) for cell in row]
>>> new_row
['to', 'be', 'or', 'not', '', '']
编辑 2: 出于好奇,我 运行 快速测试一下哪种方法最快:
>>> from random import randint
>>> from statistics import mean
>>> from timeit import repeat
>>>
>>> def standard(seq):
... """Trivial modification to standard method for removing duplicates."""
... seen = set()
... seen_add = seen.add
... return ["" if x in seen or seen_add(x) else x for x in seq]
...
>>> def dedup(seq):
... seen = set()
... for v in seq:
... yield '' if v in seen else v
... seen.add(v)
...
>>> def pedro(seq):
... """Pedro's iterator based approach to removing duplicates."""
... my_dedup = dedup
... return [x for x in my_dedup(seq)]
...
>>> def srgerg(seq):
... """Srgerg's dict based approach to removing duplicates."""
... seen = dict()
... seen_setdefault = seen.setdefault
... return ["" if cell in seen else seen_setdefault(cell, cell) for cell in seq]
...
>>> data = [randint(0, 10000) for x in range(100000)]
>>>
>>> mean(repeat("standard(data)", "from __main__ import data, standard", number=100))
1.2130275770426708
>>> mean(repeat("pedro(data)", "from __main__ import data, pedro", number=100))
3.1519048346103555
>>> mean(repeat("srgerg(data)", "from __main__ import data, srgerg", number=100))
1.2611971098676882
从结果可以看出,对this other stack-overflow question中描述的标准方法进行相对简单的修改是最快的。
您可以使用 set
来跟踪看到的项目。使用上面使用的示例列表:
x = ['to', 'be', 'or', 'not', 'to', 'be']
seen = set()
for index, item in enumerate(x):
if item in seen:
x[index] = ''
else:
seen.add(item)
print x
您可以创建一个新列表,如果元素不存在于新列表中,则追加该元素;如果该元素已存在于新列表中,则追加 None。
oldList = [3, 1, 'a', 2, 4, 2, 'a', 5, 1, 3]
newList = []
for i in oldList:
if i in newList:
newList.append(None)
else:
newList.append(i)
print newList
输出:
[3, 1, 'a', 2, 4, None, None, 5, None, None]