Fuzzywuzzy

Question

我的目标是根据 2 个独立的数据帧匹配地址信息。一个数据框包含唯一值，而另一个数据框不包含唯一值。我想从 df1 中获取唯一密钥，并根据模糊匹配的相似程度将其复制到 df2。

这是一个例子：

df1 = 

index  address_df1           unique key     call #      Name        Sales amount
                      (value I want to copy)

1    123 nice road           Uniquekey1      11       jim bob             8
2    150  spring drive       Uniquekey2      151      jane doe            8213
3    240 happy lane          Uniquekey3      71       michael scott       909
4    80 sad parkway          Uniquekey4      1586     tracey jackson      109
5    122 big lane            Uniquekey5      161      lulu buzz           99
6    315 small pk            Uniquekey6      586      tinker bell         11
7    13  round rd ste 10     Uniquekey7      8601     jack ryan          681
8    97  square rd           Uniquekey8      66       peter paul         61968

df2 (*note address column in different place) =

index   cost center        country         address_df2  

1         1111              us              123 nice rd 
2         1111              us              97  square rd 
3         1112              us              13  round rd
4         1112              us              150  spring dr

我希望最终数据框如下所示：

RESULT 
df3 (with unique key) =

index   cost center(df2)      country(df2)      address_df2         **unique key(from df1)**   fuzzy match %

1         1111                  us              123 nice rd            Uniquekey1               90%
2         1111                  us              97  square rd          Uniquekey8               90%
3         1112                  us              13  round rd           Uniquekey7               90%
4         1112                  us              150  spring dr         Uniquekey2               90%

我试过：

from fuzzywuzzy import process

THRESHOLD = 90

best_match = \
    df2['address_df2'].apply(lambda x: process.extractOne(x, df1['address_df1'],
                                                      score_cutoff=THRESHOLD))

我能够使用此代码找到匹配项，这太棒了！但是，当我合并这两个数据帧时，我无法获得匹配的地址。我认为来自 df1 的数据没有按地址一起排序或匹配。

我已经尝试过此代码（如下）来匹配 2 个 dfs，但同样地，地址没有对齐。所以最终发生的是唯一 ID 不正确。


df3 = pd.merge(df2, df1.set_index(best_match.apply(pd.Series)[2]),
               left_index=True, right_index=True, how='left')

Answer 1

试试这个：

df3 = pd.concat([df2, best_match.apply(pd.Series).drop(2, axis=1)], axis=1).rename({0:'unique key', 1:'fuzzy match %'}, axis=1)

输出：

>>> df3
   index  cost-center country    address_df2          unique key  fuzzy match %
0      1         1111      us    123-nice-rd       123-nice-road             92
1      2         1111      us   97-square-rd        97-square-rd            100
2      3         1112      us    13-round-rd  13-round-rd-ste-10             90
3      4         1112      us  150-spring-dr    150-spring-drive             90

Answer 2

extractOne返回的元组的第三项是df1的最佳匹配行的索引标签。因此，您可以使用 loc 到 select 来自 df1 的 unique key 列。

# Prefer use thefuzz package
from thefuzz import process

THRESHOLD = 90

best_match = lambda x: process.extractOne(x, df1['address_df1'])
match = df2['address_df2'].apply(best_match).apply(pd.Series)

df2['unique key'] = df1.loc[match[2], 'unique key'] \
                       .mask(match[1].lt(THRESHOLD).values) \
                       .values

输出：

>>> df2
   cost center country     address_df2  unique key
1         1111      us     123 nice rd  Uniquekey1
2         1111      us    97 square rd  Uniquekey8
3         1112      us     13 round rd  Uniquekey7
4         1112      us   150 spring dr  Uniquekey2
5         1113      fr  26 chemin vert         NaN  # for testing

>>> match
                    0    1  2
1       123 nice road   92  1
2        97 square rd  100  8
3  13 round rd ste 10   90  7
4    150 spring drive   90  2
5       123 nice road   44  1  # for testing

Fuzzywuzzy - 使用匹配将与行相关的信息从一个 df 复制到另一个 df

Fuzzywuzzy - copy info associated with a row from one df to another using a match

python

dataframe

pandas