如何遍历2列并一一匹配
How to iterate through 2 columns and match one by one
假设我有 2 个 excel 个文件,每个文件包含一列名称和日期
Excel 1:
Name
0 Bla bla bla June 04 2018
1 Puppy Dog June 01 2017
2 Donald Duck February 24 2017
3 Bruno Venus April 24 2019
Excel 2:
Name
0 Pluto Feb 09 2019
1 Donald Glover Feb 22 2020
2 Dog Feb 22 2020
3 Bla Bla Feb 22 2020
我想将第 1 列中的每个单元格与第 2 列中的每个单元格进行匹配,然后找到最大的相似度。
以下函数将给出两个输入相互匹配程度的百分比值。
SequenceMatcher 代码示例:
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
x = "Adam Clausen a Feb 09 2019"
y = "Adam Clausen Feb 08 2019"
print(similar(x,y))
输出:0.92
如果您知道如何将列加载为数据框..这段代码应该可以完成您的工作..
from difflib import SequenceMatcher
col_1 = ['potato','tomato', 'apple']
col_2 = ['tomatoe','potatao','appel']
def similar(a,b):
ratio = SequenceMatcher(None, a, b).ratio()
matches = a, b
return ratio, matches
for i in col_1:
print(max(similar(i,j) for j in col_2))
UPDATED/SOLVED 部分
下面的代码是这样的:
- 它获取 2 个输入文件并将它们转换为数据帧
- 然后它将获取一个特定的列(在本例中它们都称为名称)并将其用作匹配输入
- 它从文件 1 中取一个名字并贯穿文件 2 中的所有名字
- 然后它采用匹配度最高的名称并保存它们各自的行,并将它们并排保存在输出文件中
代码:
import pandas as pd
import numpy as np
from difflib import SequenceMatcher
def similar(a, b):
ratio = SequenceMatcher(None, a, b).ratio()
return ratio
#Load Batchlog to Data frame
data1 = pd.read_excel (r'File1.xlsx')
data2 = pd.read_excel (r'File2.xlsx')
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df1['Name'] = df1['Name'].astype(str)
df2['Name'] = df2['Name'].astype(str)
#Function/LOOP
order = []
for index, row in df1.iterrows():
maxima = [similar(row['Name'], j) for j in df2['Name']]
#best_Ratio=Best Match
best_ratio = max(maxima)
best_row = np.argmax(maxima)
#Rearrange new order and save in Output File
order.append(best_row)
df2 = df2.iloc[order].reset_index()
pd.concat([df1, df2], axis=1)
dfFinal=pd.concat([df1, df2], axis=1)
dfFinal.to_excel("OUTPUT.xlsx")
#Thank you for the help!
假设我有 2 个 excel 个文件,每个文件包含一列名称和日期
Excel 1:
Name
0 Bla bla bla June 04 2018
1 Puppy Dog June 01 2017
2 Donald Duck February 24 2017
3 Bruno Venus April 24 2019
Excel 2:
Name
0 Pluto Feb 09 2019
1 Donald Glover Feb 22 2020
2 Dog Feb 22 2020
3 Bla Bla Feb 22 2020
我想将第 1 列中的每个单元格与第 2 列中的每个单元格进行匹配,然后找到最大的相似度。
以下函数将给出两个输入相互匹配程度的百分比值。
SequenceMatcher 代码示例:
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
x = "Adam Clausen a Feb 09 2019"
y = "Adam Clausen Feb 08 2019"
print(similar(x,y))
输出:0.92
如果您知道如何将列加载为数据框..这段代码应该可以完成您的工作..
from difflib import SequenceMatcher
col_1 = ['potato','tomato', 'apple']
col_2 = ['tomatoe','potatao','appel']
def similar(a,b):
ratio = SequenceMatcher(None, a, b).ratio()
matches = a, b
return ratio, matches
for i in col_1:
print(max(similar(i,j) for j in col_2))
UPDATED/SOLVED 部分
下面的代码是这样的:
- 它获取 2 个输入文件并将它们转换为数据帧
- 然后它将获取一个特定的列(在本例中它们都称为名称)并将其用作匹配输入
- 它从文件 1 中取一个名字并贯穿文件 2 中的所有名字
- 然后它采用匹配度最高的名称并保存它们各自的行,并将它们并排保存在输出文件中
代码:
import pandas as pd
import numpy as np
from difflib import SequenceMatcher
def similar(a, b):
ratio = SequenceMatcher(None, a, b).ratio()
return ratio
#Load Batchlog to Data frame
data1 = pd.read_excel (r'File1.xlsx')
data2 = pd.read_excel (r'File2.xlsx')
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df1['Name'] = df1['Name'].astype(str)
df2['Name'] = df2['Name'].astype(str)
#Function/LOOP
order = []
for index, row in df1.iterrows():
maxima = [similar(row['Name'], j) for j in df2['Name']]
#best_Ratio=Best Match
best_ratio = max(maxima)
best_row = np.argmax(maxima)
#Rearrange new order and save in Output File
order.append(best_row)
df2 = df2.iloc[order].reset_index()
pd.concat([df1, df2], axis=1)
dfFinal=pd.concat([df1, df2], axis=1)
dfFinal.to_excel("OUTPUT.xlsx")
#Thank you for the help!