比较两个 python 列表以查找差异并忽略拼写错误

Comparing two python lists to find differences and ignoring spelling mistakes

我有两个 excel 工作表,我正在构建一个小程序来比较这些工作表中的两列以找出差异。问题是,由于大多数这些输入都是手动完成的,因此存在很多拼写错误,这些错误应该被忽略。该程序应突出显示新的或删除的数据。

我正在阅读有关模糊文本的内容,我在网上找到了这段代码 (link),但它的输出只是生成了一个具有完全相同条目的 CSV(不是我想要的)。我仍然会在这里添加它,以便您了解我在说什么。

from __future__ import division
import numpy as np
import pandas as pd
from collections import Counter
import collections
from fuzzywuzzy import fuzz
import time
from two_lists_similarity import Calculate_Similarity as cs

#the first file
book_old = pd.read_excel(r' #Input file here', sheet_name = '#Sheet Name Here')
data_old = book_old.iloc[7:,2].tolist()
#Selecting the column i want to compare

#second file to compare with
book_new = pd.read_excel(r'#source here', sheet_name = '#Sheet name')
data_new = book_new.iloc[7:,2].tolist() #selecting col

inp_list = data_old
ref_list = data_new
#this is what i picked up online because i couldnt do myself
#the plan is to iterate the list and find entries that are different, ignore spellings
# Create an instance of the class. This is otherwise called as an object 
csObj = cs(inp_list,ref_list)
# csObj is now the object of Calculate Similarity class. 
csObj.fuzzy_match_output(output_csv_name = 'pkg_sim_test_vsc.csv', output_csv_path = r'#Output path')

您可能需要一些函数来计算两个字符串的差异

事实证明,已经有一些算法可以做到这一点!查看 Damerau–Levenshtein distance,它似乎最接近您的用例。来自维基百科:

Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other.

但是,这只会发现简单的拼写错误并且容易出现误报,因此您可能希望将它与其他一些机制结合使用。

网络上有 Python 个此算法的实现(参见 here or there)。

或者,随时查看其他一些算法,例如: