模糊比较两个地址数据帧并将信息从 1 复制到另一个

Fuzzy-compare two dataframes of addresses and copy info from 1 to another

我有这个数据集。 df1 = 70,000 行和 df2 = ~30 行。我想匹配地址以查看 df2 是否出现在 df1 中,如果出现,我想显示匹配项并从 df1 中提取信息以创建新的 df3。有时地址信息会有点偏差..例如(road = rd, street = st, etc)这是一个例子:

df1 = 

address                unique key (and more columns)

123 nice road           Uniquekey1
150  spring drive       Uniquekey2
240 happy lane          Uniquekey3
80 sad parkway          Uniquekey4
etc


df2 =

address            (and more columns)

123 nice rd          
150  spring dr      
240 happy lane          
80 sad parkway         
etc

这就是我想要的新数据框:

df3=

address(from df2)     addressed matched(from df1)     unique key(comes from df1) (and more columns)      

123 nice rd            123 nice road                    Uniquekey1
150  spring dr         150  spring drive                Uniquekey2
240 happy lane         240 happy lane                   Uniquekey3
 80 sad parkway        80 sad parkway                   Uniquekey4
etc            

这是我到目前为止使用 difflib 尝试过的内容:

df1['key'] = df1['address']
df2['key'] = df2['address']

df2['key'] = df2['key'].apply(lambda x: difflib.get_close_matches(x, df1['key'], n=1))

this returns what looks like a list, the answer is in []'s so then I convert the df2['key'] into a string using df2['key'] = df2['key'].apply(str)

then I try to merge using df2.merge(df1, on ='key') and no address is matching?

我不确定它可能是什么,但我们将不胜感激。我也在玩 fuzzywuzzy 包。

我的回答与我回答的 您的老问题相似。

我稍微修改了你的数据框:

>>> df1
             address  unique key
0      123 nice road  Uniquekey1
1  150  spring drive  Uniquekey2
2     240 happy lane  Uniquekey3
3     80 sad parkway  Uniquekey4

>>> df2  # shuffle rows
          address
0  80 sad parkway
1  240 happy lane
2  150  winter dr  # change the season :-)
3     123 nice rd

使用 fuzzywuzzy.process:

中的 extractOne 函数
from fuzzywuzzy import process

THRESHOLD = 90

best_match = \
    df2['address'].apply(lambda x: process.extractOne(x, df1['address'],
                                                      score_cutoff=THRESHOLD))

extractOne的输出是:

>>> best_match
0    (80 sad parkway, 100, 3)
1    (240 happy lane, 100, 2)
2                        None
3      (123 nice road, 92, 0)
Name: address, dtype: object

现在您可以合并您的 2 个数据框:

df3 = pd.merge(df2, df1.set_index(best_match.apply(pd.Series)[2]),
               left_index=True, right_index=True, how='left')
>>> df3
        address_x          address_y  unique key
0  80 sad parkway     80 sad parkway  Uniquekey4
1  240 happy lane                NaN         NaN
2  150  winter dr  150  spring drive  Uniquekey2
3     123 nice rd      123 nice road  Uniquekey1

这个答案比较长,但我会 post 因为你可以更好地跟进,因为你可以看到发生的步骤。

设置框架:

import pandas as pd

#pip install fuzzywuzzy
#pip install python-Levenshtein
from fuzzywuzzy import fuzz, process

# matching threshold.  may need altering from 45-95 etc.  higher is better but being stricter means things aren't matched.  fiddle as required
threshold = 75

df1 = pd.DataFrame({'address': {0: '123 nice road',
  1: '150  spring drive',
  2: '240 happy lane',
  3: '80 sad parkway'},
 'unique key (and more columns)': {0: 'Uniquekey1',
  1: 'Uniquekey2',
  2: 'Uniquekey3',
  3: 'Uniquekey4'}})

df2 = pd.DataFrame({'address': {0: '123 nice rd',
  1: '150  spring dr',
  2: '240 happy lane',
  3: '80 sad parkway'},
 'unique key (and more columns)': {0: 'Uniquekey1',
  1: 'Uniquekey2',
  2: 'Uniquekey3',
  3: 'Uniquekey4'}})

然后主要代码:

# function used for fuzzywuzzy matching
def match_addresses(add, list_add, min_score=0):
    max_score = -1
    max_add = ''
    for x in list_add:
        score = fuzz.ratio(add, x)
        if (score > min_score) & (score > max_score):
            max_add = x
            max_score = score
    return (max_add, max_score)

# return the fuzzywuzzy score
def scoringMatches(x, s):
    o = process.extractOne(x, s, score_cutoff = threshold)
    if o != None:
        return o[1]
    
# creating two lists from address column of both dataframes
df1_addresses = list(df1.address.unique())
df2_addresses = list(df2.address.unique())

# via fuzzywuzzy matching and using match_addresses() above
# return a dictionary of addresses where there is a match
names = []
for x in df1_addresses:
    match = match_addresses(x, df2_addresses, threshold)
    if match[1] >= threshold:
        name = (str(x), str(match[0]))
        names.append(name)
name_dict = dict(names)


# create new frame from fuzzywuzzy address matches dictionary
match_df = pd.DataFrame(name_dict.items(), columns=['df1_address', 'df2_address'])

# create new frame
df3 = pd.concat([df1, match_df], axis=1)
del df3['df1_address']

# shuffle the matched address column to be next to the original address of df1
c = df3.columns.tolist()
c.insert(1, c.pop(c.index('df2_address')))
df3 = df3.reindex(columns=c)

# add fuzzywuzzy scoring as a new column
df3['fuzzywuzzy_score'] = df3.apply(lambda x: scoringMatches(x['address'], df2['address']), axis=1)

print(df3)

输出:

    address             df2_address     unique key (and more columns)   fuzzywuzzy_score
0   123 nice road       123 nice rd     Uniquekey1                      92
1   150 spring drive    150 spring dr   Uniquekey2                      90
2   240 happy lane      240 happy lane  Uniquekey3                      100
3   80 sad parkway      80 sad parkway  Uniquekey4                      100