根据相似字符串数组更正字符串中的错误

Question

有人可以指出正确的方向来完成这个任务吗：我有很多带有此类桶的字符串数组：

两年前有人这样做了
两年前有人做过
两年前有人这样做了
Somedody d 那是两年前
两年前的某个人

我需要得到这个：两年前有人做过

算法或库的任何链接都很棒。这些字符串来自 OCR，有时 OCR 会在 letter/words 中出错，但我对同一字符串有 2-5 种不同的拼写。

更新根据@alec_djinn 的建议，我找到了 python 库，它可以根据 Levenshtein 距离创建 "median" 字符串。 https://rawgit.com/ztane/python-Levenshtein/master/docs/Levenshtein.html#Levenshtein-median

Answer 1

您可以在比对序列上使用 sequence alignment algorithm and then find the consensus。

有大量可用的库和软件，但它们通常仅适用于生物序列（DNA、RNA、蛋白质）。一个用于一般字符串对齐的 python 库是 https://pypi.python.org/pypi/alignment/

对齐序列后，您可以使用以下（非常基本的）方法计算共识。

def compute_consensus(sequences):
    consensus = ''
    for i in range(len(sequences[0])):
        char_count = Counter()
        for seq in sequences:
            char_count.update(seq[i])
        consensus += char_count.most_common()[0][0]

    return consensus.replace('-','') #assuming '-' represent deleted letters

其中 sequences 是比对序列的列表。所有对齐的序列应具有相同的长度。

根据相似字符串数组更正字符串中的错误

Correcting mistakes in strings based on array of similar strings

grouping

text

fuzzy