Python：如何删除 duplicate/similar 行

Question

我有一个包含很多消息的文件。每行一条独特的消息，其结构彼此相似，但略有修改。示例如下：

Error number 609 at line 10
Error number 609 at line 22
Error string "foo" at line 11
Error string "bar" at line 14

并希望输出类似于：

Error number 609 at line 10
Error string "foo" at line 11

它们是 "same" 类型的错误。

我设法删除了类似的行，但我遇到的问题是我必须遍历文件中的每一行多少次，直到它不再有 "duplicates"。

我目前拥有的：

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

lst = open("result.txt").readlines()
print(len(lst))
for i in lst:
    for index, line in enumerate(lst):
        try:
            if similar(lst[index],lst[index + 1]) > 0.8:
                lst.pop(index)
        except:
            pass

print(len(lst))

但这不是一个可靠的方法，因为它可能会过度循环，或者如果文件非常大且包含许多 "same" 行，则它可能不够。

编辑：

文件中多种消息类型之一的更准确示例如下：

[{TYPE}] Timeout after {miliseconds} millis, source ref: {random-number}, system: {system}, delivered {system}: , current {system}: {time}

Answer 1

您只需要逐行打开并读取日志文件。

a=b=None
with open('result.txt') as infile:
    if a == None:
        a = infile.readline()
    b = infile.readline()
    while a:
        a = infile.readline()
        print('proc similar(a,b)')
        b = a

Answer 2

假设输入文件中的每个条目都采用以下格式...

[{TYPE}] Timeout after {miliseconds} millis, source ref: {random-number}...

lst = open("result.txt").readlines()

pretoken = "["
posttoken = "]"

foundTypes = []
log = []

for line in lst:
    foundType = ""
    for letter in line:
        if letter == pretoken: pass
        elif letter == posttoken: break
        else: foundType += letter

    if foundType not in foundTypes:
        foundTypes.append(foundType)
        log.append(line)

print(log)

Python：如何删除 duplicate/similar 行

Python: How to remove duplicate/similar lines

python

sorting

algorithm

similarity

duplicates