比较两个 python 列表以查找差异并忽略拼写错误

Question

我有两个 excel 工作表，我正在构建一个小程序来比较这些工作表中的两列以找出差异。问题是，由于大多数这些输入都是手动完成的，因此存在很多拼写错误，这些错误应该被忽略。该程序应突出显示新的或删除的数据。

我正在阅读有关模糊文本的内容，我在网上找到了这段代码 (link)，但它的输出只是生成了一个具有完全相同条目的 CSV（不是我想要的）。我仍然会在这里添加它，以便您了解我在说什么。

from __future__ import division
import numpy as np
import pandas as pd
from collections import Counter
import collections
from fuzzywuzzy import fuzz
import time
from two_lists_similarity import Calculate_Similarity as cs

#the first file
book_old = pd.read_excel(r' #Input file here', sheet_name = '#Sheet Name Here')
data_old = book_old.iloc[7:,2].tolist()
#Selecting the column i want to compare

#second file to compare with
book_new = pd.read_excel(r'#source here', sheet_name = '#Sheet name')
data_new = book_new.iloc[7:,2].tolist() #selecting col

inp_list = data_old
ref_list = data_new
#this is what i picked up online because i couldnt do myself
#the plan is to iterate the list and find entries that are different, ignore spellings
# Create an instance of the class. This is otherwise called as an object 
csObj = cs(inp_list,ref_list)
# csObj is now the object of Calculate Similarity class. 
csObj.fuzzy_match_output(output_csv_name = 'pkg_sim_test_vsc.csv', output_csv_path = r'#Output path')

Answer 1

您可能需要一些函数来计算两个字符串的差异。

事实证明，已经有一些算法可以做到这一点！查看 Damerau–Levenshtein distance，它似乎最接近您的用例。来自维基百科：

Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other.

但是，这只会发现简单的拼写错误并且容易出现误报，因此您可能希望将它与其他一些机制结合使用。

网络上有 Python 个此算法的实现（参见 here or there）。

或者，随时查看其他一些算法，例如：

比较两个 python 列表以查找差异并忽略拼写错误

Comparing two python lists to find differences and ignoring spelling mistakes

python

list

fuzzy-comparison

python-3.x