Python: 识别并删除列表中的重复序列
Python: Identifying and deleting duplicate sequences in list
我正在寻找在列表中查找重复序列的最佳方法。
序列被定义为至少两个相邻值。
示例:在下面的列表中,应识别并删除重复序列。
a = [45874,
35195, # <-
28965,
05867,
25847, # <-
94937,
64894,
55535,
62899,
00391,
35195, # Duplicate Sequence Start
28965,
05867,
25847, # Duplicate Sequence End
08483,
55801,
33129,
42616]
我想不出任何解决方案,非常感谢您的帮助!
我用下面的方法解决了,可能不是很有效:
sequences=[] #all the sequences, as defined in question
duplicated=[] #all the duplicated sequences
for i in range(2, len(a)): #possible lengths of sequence
n=0 # index of start of sequence
for j in a:
if n+i<=len(a): #if it's not too close to the end
sequences.append(a[n:n+i]) #add the sequence
n+=1
tests=sequences[:] #duplicate so that I can remove for checking purposes
for i in sequences:
tests.remove(i)
if i in tests: #if it's in there twice
duplicated.append(i) #add it to the duplicates
for i in duplicated:
found=False #haven't found it yet
n=0 #index we're at
for j in a:
if a[n:n+len(i)]==i and not found: #if it matches
del a[n:n+len(i)] #remove it
found=True #don't remove it again
n+=1
在长度为 n 的字符串中找到长度为 m 的单个子序列至少可以在 内完成O(nm) 与 Boyer-Moore algorithm。找到所有重复的子序列很可能不止于此。不过,如果你只想删除子序列而不是找到它们,那么有一个技巧。
删除 O(n)
中的重复子序列
我们只需要关注长度为 2 的子序列,因为任何序列都可以表示为长度为 2 的重叠子序列。这一观察使我们能够对此类对进行计数,然后从右到左删除它们。
特别是,这只需要遍历列表固定次数,因此 O(n)。
from collections import Counter
def remove_duplicates(lst):
dropped_indices = set()
counter = Counter(tuple(lst[i:i+2]) for i in range(len(lst) - 1))
for i in range(len(lst) - 2, -1, -1):
sub = tuple(lst[i:i+2])
if counter[sub] > 1:
dropped_indices |= {i, i + 1}
counter[sub] -= 1
return [x for i, x in enumerate(lst) if i not in dropped_indices]
这是一个基于提供的输入的示例。
a = [45874,
35195, # This
28965, # Is
5867, # A
25847, # Subsequence
94937,
64894,
55535,
62899,
391,
35195, # Those
28965, # Should
5867, # Be
25847, # Removed
8483]
b = remove_duplicates(a)
# Output:
# [45874,
# 35195,
# 28965,
# 5867,
# 25847,
# 94937,
# 64894,
# 55535,
# 62899,
# 391,
# <- Here the duplicate was removed
# 8483]
我正在寻找在列表中查找重复序列的最佳方法。 序列被定义为至少两个相邻值。
示例:在下面的列表中,应识别并删除重复序列。
a = [45874,
35195, # <-
28965,
05867,
25847, # <-
94937,
64894,
55535,
62899,
00391,
35195, # Duplicate Sequence Start
28965,
05867,
25847, # Duplicate Sequence End
08483,
55801,
33129,
42616]
我想不出任何解决方案,非常感谢您的帮助!
我用下面的方法解决了,可能不是很有效:
sequences=[] #all the sequences, as defined in question
duplicated=[] #all the duplicated sequences
for i in range(2, len(a)): #possible lengths of sequence
n=0 # index of start of sequence
for j in a:
if n+i<=len(a): #if it's not too close to the end
sequences.append(a[n:n+i]) #add the sequence
n+=1
tests=sequences[:] #duplicate so that I can remove for checking purposes
for i in sequences:
tests.remove(i)
if i in tests: #if it's in there twice
duplicated.append(i) #add it to the duplicates
for i in duplicated:
found=False #haven't found it yet
n=0 #index we're at
for j in a:
if a[n:n+len(i)]==i and not found: #if it matches
del a[n:n+len(i)] #remove it
found=True #don't remove it again
n+=1
在长度为 n 的字符串中找到长度为 m 的单个子序列至少可以在 内完成O(nm) 与 Boyer-Moore algorithm。找到所有重复的子序列很可能不止于此。不过,如果你只想删除子序列而不是找到它们,那么有一个技巧。
删除 O(n)
中的重复子序列我们只需要关注长度为 2 的子序列,因为任何序列都可以表示为长度为 2 的重叠子序列。这一观察使我们能够对此类对进行计数,然后从右到左删除它们。
特别是,这只需要遍历列表固定次数,因此 O(n)。
from collections import Counter
def remove_duplicates(lst):
dropped_indices = set()
counter = Counter(tuple(lst[i:i+2]) for i in range(len(lst) - 1))
for i in range(len(lst) - 2, -1, -1):
sub = tuple(lst[i:i+2])
if counter[sub] > 1:
dropped_indices |= {i, i + 1}
counter[sub] -= 1
return [x for i, x in enumerate(lst) if i not in dropped_indices]
这是一个基于提供的输入的示例。
a = [45874,
35195, # This
28965, # Is
5867, # A
25847, # Subsequence
94937,
64894,
55535,
62899,
391,
35195, # Those
28965, # Should
5867, # Be
25847, # Removed
8483]
b = remove_duplicates(a)
# Output:
# [45874,
# 35195,
# 28965,
# 5867,
# 25847,
# 94937,
# 64894,
# 55535,
# 62899,
# 391,
# <- Here the duplicate was removed
# 8483]