使用 python 将近似字符串匹配从 excel 复制到另一个 excel 文件

Question

您好，我想问一下如何将一些行从一个 excel 文件复制到另一个 excel 文件。通过使用python模糊匹配方法或ANY其他可行的方法，希望根据名称匹配整行并复制到新的excel文件中。

这里是第一个excel文件的输入数据，共有13行6列如下图：

-----------------------------------------------------|-----|-----|-----|-----|-----|
| name                                               | no1 | no2 | no3 | no4 | no5 |
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Club___Long to Club___Short___Water           | abc | abc | abc | abc | abc |
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Club___Long to Short___Water                  | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Club___Long___Land to Short___Water           | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Kinabalu___BB to Penang___AA                  | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Kinabalu___SD to Penang___SD                  | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Hill___Town to Unknown___Island___Ice         | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Hill___Town to Unknown___Island___Ice         | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Hill___Town to Unknown___Island___Ice         | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Front___House___AA(N) to Back___Garden(N)     | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Front___House___AA___(N) to Back___Garden     | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Left___House___Hostel(w) to NothingNow___(w)  | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Laksama to Kota_Dun                           | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|

通过删除第一行，我想让 python 识别大致相似的行名称并复制整行并粘贴到新的 excel 文件中。通过比较单词的相似度而不是字母表，比如有多少单词是相同的，如果大于或等于一定数量（比如50%），就会通过复制。

例如，通过比较第2行和第3行，from Club___Long to Club___Short___Water与from Club___Long to Short___Water非常相似，from Club___Long to Club___Short___Water有7个词，而from Club___Long to Short___Water有6个词。在 from Club___Long to Club___Short___Water 的 7 个词中，有 6 个词与 from Club___Long to Short___Water 相似。因此，6 / 7 * 100% = 85.71%超过50%，python会认为匹配并复制。

比如第2行到第4行大体相同，那么python会匹配识别到几乎一样，只复制第2行到第4行整行到新的excel 文件，并将其命名为 'new_file_1.xlsx'。想要的输出如下图：

-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Club___Long to Club___Short___Water           | abc | abc | abc | abc | abc |
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Club___Long to Short___Water                  | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Club___Long___Land to Short___Water           | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|

第5行和第6行同理，命名为'new_file_2.xlsx'，输出如下图：

-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Kinabalu___BB to Penang___AA                  | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Kinabalu___SD to Penang___SD                  | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|

同理第7行到第9行，命名为'new_file_3.xlsx'，输出结果如下图：

-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Hill___Town to Unknown___Island___Ice         | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Hill___Town to Unknown___Island___Ice         | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Hill___Town to Unknown___Island___Ice         | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|

同理第10行到第11行，命名为'new_file_4.xlsx'，输出结果如下：

-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Front___House___AA(N) to Back___Garden(N)     | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Front___House___AA___(N) to Back___Garden     | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|

关于第12行和第13行，它们都与其他行不同，所以不必复制，留着。

如果有人能帮助我，我将不胜感激，谢谢！

Answer 1

我创建了一个函数来替换重复项。它基于模糊逻辑。考虑到截止值，我只是用所有其他名称中匹配度最高的名称替换每个名称。然后，我创建一个新列来存储这些唯一名称

import difflib
import re

def similarity_replace(series):

    reverse_map = {}
    diz_map = {}
    for i,s in series.iteritems():

        clean_s = re.sub(r'(from)|(to)', '', s.lower())
        clean_s = re.sub(r'[^a-z]', '', clean_s)

        diz_map[s] = clean_s
        reverse_map[clean_s] = s

    best_match = {}
    uni = list(set(diz_map.values()))
    for w in uni:
        best_match[w] = sorted(difflib.get_close_matches(w, uni, n=3, cutoff=0.6))[0]

    return series.map(diz_map).map(best_match).map(reverse_map)

df = pd.DataFrame({'name':['from Club___Long to Club___Short___Water','from Club___Long to Short___Water',
                           'from Club___Long___Land to Short___Water','from Kinabalu___BB to Penang___AA',
                           'from Kinabalu___SD to Penang___SD','from Hill___Town to Unknown___Island___Ice',
                           'from Hill___Town to Unknown___Island___Ice','from Hill___Town to Unknown___Island___Ice',
                           'from Front___House___AA(N) to Back___Garden(N)','from Front___House___AA___(N) to Back___Garden',
                           'from Left___House___Hostel(w) to NothingNow___(w)','from Laksama to Kota_Dun'],
                  'no1':['adb','adb','adb','adb','adb','adb','adb','adb','adb','adb','adb','adb']})

df['group_name'] = similarity_replace(df.name)
df

我们可以使用此列对所有相似的实例进行分组

for i,group in df.groupby('group_name'):

    ### do something ###
    print(group[['name','no1']])

使用 python 将近似字符串匹配从 excel 复制到另一个 excel 文件

Copy approximate string matching from excel to another excel file using python

python

fuzzy

matching

pandas