从列表列表中有效地删除与顺序无关的重复项
Efficiently remove duplicates, order-agnostic, from list of lists
以下列表有一些重复的子列表,元素的顺序不同:
l1 = [
['The', 'quick', 'brown', 'fox'],
['hi', 'there'],
['jumps', 'over', 'the', 'lazy', 'dog'],
['there', 'hi'],
['jumps', 'dog', 'over','lazy', 'the'],
]
如何删除重复项,保留看到的第一个实例,以获得:
l1 = [
['The', 'quick', 'brown', 'fox'],
['hi', 'there'],
['jumps', 'over', 'the', 'lazy', 'dog'],
]
我试过:
[list(i) for i in set(map(tuple, l1))]
尽管如此,我不知道这是否是处理大型列表的最快方法,而且我的尝试没有达到预期效果。知道如何有效地移除它们吗?
这个有点棘手。你想从冻结的计数器中键入一个命令,但计数器在 Python 中不可散列。对于渐近复杂性的小幅下降,您可以使用排序的元组代替冻结计数器:
seen = set()
result = []
for x in l1:
key = tuple(sorted(x))
if key not in seen:
result.append(x)
seen.add(key)
同样的想法在一行中看起来像这样:
[*{tuple(sorted(k)): k for k in reversed(l1)}.values()][::-1]
@wim 的回答效率低下,因为它将列表项排序作为唯一标识一组列表项计数的方式,这会花费 O(n log n) 时间每个子列表的复杂性。
要在线性时间复杂度内实现相同的目标,您可以使用 collections.Counter
class 的冻结项目计数集。由于 dict comprehension 保留了带有重复键的项目的最后一个值,并且由于您想在问题中保留带有重复键的项目的第一个值,因此您必须按照列表的相反顺序构造 dict,并在之后再次反转它已构建去重子列表列表:
from collections import Counter
list({frozenset(Counter(lst).items()): lst for lst in reversed(l1)}.values())[::-1]
这个returns:
[['The', 'quick', 'brown', 'fox'], ['hi', 'there'], ['jumps', 'over', 'the', 'lazy', 'dog']]
这个:
l1 = [['The', 'quick', 'brown', 'fox'], ['hi', 'there'], ['jumps', 'over', 'the', 'lazy', 'dog'], ['there', 'hi'], ['jumps', 'dog', 'over','lazy', 'the']]
s = {tuple(item) for item in map(sorted, l1)}
l2 = [list(item) for item in s]
l2 给出了删除反向重复项的列表。
比较:
我做了一个快速基准测试,比较了各种答案:
l1 = [['The', 'quick', 'brown', 'fox'], ['hi', 'there'], ['jumps', 'over', 'the', 'lazy', 'dog'], ['there', 'hi'], ['jumps', 'dog', 'over','lazy', 'the']]
from collections import Counter
def method1():
"""manually construct set, keyed on sorted tuple"""
seen = set()
result = []
for x in l1:
key = tuple(sorted(x))
if key not in seen:
result.append(x)
seen.add(key)
return result
def method2():
"""frozenset-of-Counter"""
return list({frozenset(Counter(lst).items()): lst for lst in reversed(l1)}.values())
def method3():
"""wim"""
return [*{tuple(sorted(k)): k for k in reversed(l1)}.values()][::-1]
from timeit import timeit
print(timeit(lambda: method1(), number=1000))
print(timeit(lambda: method2(), number=1000))
print(timeit(lambda: method3(), number=1000))
打印:
0.0025010189856402576
0.016385524009820074
0.0026451340527273715
以下列表有一些重复的子列表,元素的顺序不同:
l1 = [
['The', 'quick', 'brown', 'fox'],
['hi', 'there'],
['jumps', 'over', 'the', 'lazy', 'dog'],
['there', 'hi'],
['jumps', 'dog', 'over','lazy', 'the'],
]
如何删除重复项,保留看到的第一个实例,以获得:
l1 = [
['The', 'quick', 'brown', 'fox'],
['hi', 'there'],
['jumps', 'over', 'the', 'lazy', 'dog'],
]
我试过:
[list(i) for i in set(map(tuple, l1))]
尽管如此,我不知道这是否是处理大型列表的最快方法,而且我的尝试没有达到预期效果。知道如何有效地移除它们吗?
这个有点棘手。你想从冻结的计数器中键入一个命令,但计数器在 Python 中不可散列。对于渐近复杂性的小幅下降,您可以使用排序的元组代替冻结计数器:
seen = set()
result = []
for x in l1:
key = tuple(sorted(x))
if key not in seen:
result.append(x)
seen.add(key)
同样的想法在一行中看起来像这样:
[*{tuple(sorted(k)): k for k in reversed(l1)}.values()][::-1]
@wim 的回答效率低下,因为它将列表项排序作为唯一标识一组列表项计数的方式,这会花费 O(n log n) 时间每个子列表的复杂性。
要在线性时间复杂度内实现相同的目标,您可以使用 collections.Counter
class 的冻结项目计数集。由于 dict comprehension 保留了带有重复键的项目的最后一个值,并且由于您想在问题中保留带有重复键的项目的第一个值,因此您必须按照列表的相反顺序构造 dict,并在之后再次反转它已构建去重子列表列表:
from collections import Counter
list({frozenset(Counter(lst).items()): lst for lst in reversed(l1)}.values())[::-1]
这个returns:
[['The', 'quick', 'brown', 'fox'], ['hi', 'there'], ['jumps', 'over', 'the', 'lazy', 'dog']]
这个:
l1 = [['The', 'quick', 'brown', 'fox'], ['hi', 'there'], ['jumps', 'over', 'the', 'lazy', 'dog'], ['there', 'hi'], ['jumps', 'dog', 'over','lazy', 'the']]
s = {tuple(item) for item in map(sorted, l1)}
l2 = [list(item) for item in s]
l2 给出了删除反向重复项的列表。
比较:
我做了一个快速基准测试,比较了各种答案:
l1 = [['The', 'quick', 'brown', 'fox'], ['hi', 'there'], ['jumps', 'over', 'the', 'lazy', 'dog'], ['there', 'hi'], ['jumps', 'dog', 'over','lazy', 'the']]
from collections import Counter
def method1():
"""manually construct set, keyed on sorted tuple"""
seen = set()
result = []
for x in l1:
key = tuple(sorted(x))
if key not in seen:
result.append(x)
seen.add(key)
return result
def method2():
"""frozenset-of-Counter"""
return list({frozenset(Counter(lst).items()): lst for lst in reversed(l1)}.values())
def method3():
"""wim"""
return [*{tuple(sorted(k)): k for k in reversed(l1)}.values()][::-1]
from timeit import timeit
print(timeit(lambda: method1(), number=1000))
print(timeit(lambda: method2(), number=1000))
print(timeit(lambda: method3(), number=1000))
打印:
0.0025010189856402576
0.016385524009820074
0.0026451340527273715