Join/merge/concat pandas 中的 2 个数据框,其中关键列的拼写不一致

Join/merge/concat 2 dataframes in pandas where the key columns are inconsistently spelled

我有两个数据框,其中只有一个键列,而且是人的全名。两列及其拼写之间存在不一致。例如,一个名字可能缺少一个字母,一个像 Mr. 这样的名字前缀(另一个 df 没有),额外的空格等。我已经仔细检查了两个数据框中的这些列都是对象 types/strings .我想合并这两个数据框。

密码

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

df1["BestMatch"] = df1["FULL_NAME"].map(lambda x: process.extractOne(x,df2["FULL_NAME"])[0])

给我错误

TypeError: expected string or bytes-like object

我也试过了

#import difflib 

#difflib.get_close_matches
df1['FULL_NAME'] = df1['FULL_NAME'].apply(lambda x: difflib.get_close_matches(x,df2['FULL_NAME'])[0])

给我错误

IndexError: list index out of range

我已经搜索了如何修复这些 errors/the 代码,但似乎没有什么能给我正确的答案。我相对缺乏经验,我猜我遗漏了什么,但我不确定会是什么。

您在 df1FULL_NAME 列中有 NaN。

可重现的错误:

df1 = pd.DataFrame({'FULL_NAME': ['Louis', np.nan, 'Alexandre']})
df2 = pd.DataFrame({'FULL_NAME': ['Mr Louis', 'Mr Paul', 'Mr Alexandre']})
>>> df1["FULL_NAME"].map(lambda x: process.extractOne(x, df2["FULL_NAME"])[0])
...
TypeError: expected string or bytes-like object

如何避免这种情况:dropnaapply/map.

之前
df1['BestMatch'] = df1["FULL_NAME"].dropna().map(lambda x: process.extractOne(x,df2["FULL_NAME"])[0])
>>> df1
   FULL_NAME     BestMatch
0      Louis      Mr Louis
1        NaN           NaN
2  Alexandre  Mr Alexandre

我写了一个关于如何使用 fuzzywuzzy 合并到数据框的类似答案:

调试 尝试手动调试:

for x in df1['FULL_NAME']:
    try:
        b = process.extractOne(x, df2["FULL_NAME"])[0]
        print(f"{x} <-> {b}")
    except TypeError:
        print(f"XXX Problem with '{x}'")

输出:

Louis <-> Mr Louis
XXX Problem with 'nan'
Alexandre <-> Mr Alexandre

合并 使用索引 [2] 而不是最佳匹配名称 [0]:

df1 = pd.DataFrame({'FULL_NAME': ['Louis', np.nan, 'Alexandre'], 
                    'DATA1': [10, 20, 30]})

df2 = pd.DataFrame({'FULL_NAME': ['Mr Louis', 'Mr Paul', 'Mr Alexandre'],
                    'DATA2': [11, 21, 31]})

best_match_index = lambda x: process.extractOne(x,df2["FULL_NAME"])[2]
df1['BestMatch'] = df1["FULL_NAME"].dropna().map(best_match_index)

out = df2.merge(df1, left_index=True, right_on='BestMatch', how='left')

输出

>>> out
      FULL_NAME_x  DATA2 FULL_NAME_y  DATA1  BestMatch
0.0      Mr Louis     11       Louis   10.0          0
NaN       Mr Paul     21         NaN    NaN          1
2.0  Mr Alexandre     31   Alexandre   30.0          2