模糊比较两个地址数据帧并将信息从 1 复制到另一个
Fuzzy-compare two dataframes of addresses and copy info from 1 to another
我有这个数据集。 df1 = 70,000 行和 df2 = ~30 行。我想匹配地址以查看 df2 是否出现在 df1 中,如果出现,我想显示匹配项并从 df1 中提取信息以创建新的 df3。有时地址信息会有点偏差..例如(road = rd, street = st, etc)这是一个例子:
df1 =
address unique key (and more columns)
123 nice road Uniquekey1
150 spring drive Uniquekey2
240 happy lane Uniquekey3
80 sad parkway Uniquekey4
etc
df2 =
address (and more columns)
123 nice rd
150 spring dr
240 happy lane
80 sad parkway
etc
这就是我想要的新数据框:
df3=
address(from df2) addressed matched(from df1) unique key(comes from df1) (and more columns)
123 nice rd 123 nice road Uniquekey1
150 spring dr 150 spring drive Uniquekey2
240 happy lane 240 happy lane Uniquekey3
80 sad parkway 80 sad parkway Uniquekey4
etc
这是我到目前为止使用 difflib 尝试过的内容:
df1['key'] = df1['address']
df2['key'] = df2['address']
df2['key'] = df2['key'].apply(lambda x: difflib.get_close_matches(x, df1['key'], n=1))
this returns what looks like a list, the answer is in []'s so then I convert the df2['key'] into a string using df2['key'] = df2['key'].apply(str)
then I try to merge using df2.merge(df1, on ='key') and no address is matching?
我不确定它可能是什么,但我们将不胜感激。我也在玩 fuzzywuzzy 包。
我的回答与我回答的 您的老问题相似。
我稍微修改了你的数据框:
>>> df1
address unique key
0 123 nice road Uniquekey1
1 150 spring drive Uniquekey2
2 240 happy lane Uniquekey3
3 80 sad parkway Uniquekey4
>>> df2 # shuffle rows
address
0 80 sad parkway
1 240 happy lane
2 150 winter dr # change the season :-)
3 123 nice rd
使用 fuzzywuzzy.process
:
中的 extractOne
函数
from fuzzywuzzy import process
THRESHOLD = 90
best_match = \
df2['address'].apply(lambda x: process.extractOne(x, df1['address'],
score_cutoff=THRESHOLD))
extractOne
的输出是:
>>> best_match
0 (80 sad parkway, 100, 3)
1 (240 happy lane, 100, 2)
2 None
3 (123 nice road, 92, 0)
Name: address, dtype: object
现在您可以合并您的 2 个数据框:
df3 = pd.merge(df2, df1.set_index(best_match.apply(pd.Series)[2]),
left_index=True, right_index=True, how='left')
>>> df3
address_x address_y unique key
0 80 sad parkway 80 sad parkway Uniquekey4
1 240 happy lane NaN NaN
2 150 winter dr 150 spring drive Uniquekey2
3 123 nice rd 123 nice road Uniquekey1
这个答案比较长,但我会 post 因为你可以更好地跟进,因为你可以看到发生的步骤。
设置框架:
import pandas as pd
#pip install fuzzywuzzy
#pip install python-Levenshtein
from fuzzywuzzy import fuzz, process
# matching threshold. may need altering from 45-95 etc. higher is better but being stricter means things aren't matched. fiddle as required
threshold = 75
df1 = pd.DataFrame({'address': {0: '123 nice road',
1: '150 spring drive',
2: '240 happy lane',
3: '80 sad parkway'},
'unique key (and more columns)': {0: 'Uniquekey1',
1: 'Uniquekey2',
2: 'Uniquekey3',
3: 'Uniquekey4'}})
df2 = pd.DataFrame({'address': {0: '123 nice rd',
1: '150 spring dr',
2: '240 happy lane',
3: '80 sad parkway'},
'unique key (and more columns)': {0: 'Uniquekey1',
1: 'Uniquekey2',
2: 'Uniquekey3',
3: 'Uniquekey4'}})
然后主要代码:
# function used for fuzzywuzzy matching
def match_addresses(add, list_add, min_score=0):
max_score = -1
max_add = ''
for x in list_add:
score = fuzz.ratio(add, x)
if (score > min_score) & (score > max_score):
max_add = x
max_score = score
return (max_add, max_score)
# return the fuzzywuzzy score
def scoringMatches(x, s):
o = process.extractOne(x, s, score_cutoff = threshold)
if o != None:
return o[1]
# creating two lists from address column of both dataframes
df1_addresses = list(df1.address.unique())
df2_addresses = list(df2.address.unique())
# via fuzzywuzzy matching and using match_addresses() above
# return a dictionary of addresses where there is a match
names = []
for x in df1_addresses:
match = match_addresses(x, df2_addresses, threshold)
if match[1] >= threshold:
name = (str(x), str(match[0]))
names.append(name)
name_dict = dict(names)
# create new frame from fuzzywuzzy address matches dictionary
match_df = pd.DataFrame(name_dict.items(), columns=['df1_address', 'df2_address'])
# create new frame
df3 = pd.concat([df1, match_df], axis=1)
del df3['df1_address']
# shuffle the matched address column to be next to the original address of df1
c = df3.columns.tolist()
c.insert(1, c.pop(c.index('df2_address')))
df3 = df3.reindex(columns=c)
# add fuzzywuzzy scoring as a new column
df3['fuzzywuzzy_score'] = df3.apply(lambda x: scoringMatches(x['address'], df2['address']), axis=1)
print(df3)
输出:
address df2_address unique key (and more columns) fuzzywuzzy_score
0 123 nice road 123 nice rd Uniquekey1 92
1 150 spring drive 150 spring dr Uniquekey2 90
2 240 happy lane 240 happy lane Uniquekey3 100
3 80 sad parkway 80 sad parkway Uniquekey4 100
我有这个数据集。 df1 = 70,000 行和 df2 = ~30 行。我想匹配地址以查看 df2 是否出现在 df1 中,如果出现,我想显示匹配项并从 df1 中提取信息以创建新的 df3。有时地址信息会有点偏差..例如(road = rd, street = st, etc)这是一个例子:
df1 =
address unique key (and more columns)
123 nice road Uniquekey1
150 spring drive Uniquekey2
240 happy lane Uniquekey3
80 sad parkway Uniquekey4
etc
df2 =
address (and more columns)
123 nice rd
150 spring dr
240 happy lane
80 sad parkway
etc
这就是我想要的新数据框:
df3=
address(from df2) addressed matched(from df1) unique key(comes from df1) (and more columns)
123 nice rd 123 nice road Uniquekey1
150 spring dr 150 spring drive Uniquekey2
240 happy lane 240 happy lane Uniquekey3
80 sad parkway 80 sad parkway Uniquekey4
etc
这是我到目前为止使用 difflib 尝试过的内容:
df1['key'] = df1['address']
df2['key'] = df2['address']
df2['key'] = df2['key'].apply(lambda x: difflib.get_close_matches(x, df1['key'], n=1))
this returns what looks like a list, the answer is in []'s so then I convert the df2['key'] into a string using df2['key'] = df2['key'].apply(str)
then I try to merge using df2.merge(df1, on ='key') and no address is matching?
我不确定它可能是什么,但我们将不胜感激。我也在玩 fuzzywuzzy 包。
我的回答与我回答的
我稍微修改了你的数据框:
>>> df1
address unique key
0 123 nice road Uniquekey1
1 150 spring drive Uniquekey2
2 240 happy lane Uniquekey3
3 80 sad parkway Uniquekey4
>>> df2 # shuffle rows
address
0 80 sad parkway
1 240 happy lane
2 150 winter dr # change the season :-)
3 123 nice rd
使用 fuzzywuzzy.process
:
extractOne
函数
from fuzzywuzzy import process
THRESHOLD = 90
best_match = \
df2['address'].apply(lambda x: process.extractOne(x, df1['address'],
score_cutoff=THRESHOLD))
extractOne
的输出是:
>>> best_match
0 (80 sad parkway, 100, 3)
1 (240 happy lane, 100, 2)
2 None
3 (123 nice road, 92, 0)
Name: address, dtype: object
现在您可以合并您的 2 个数据框:
df3 = pd.merge(df2, df1.set_index(best_match.apply(pd.Series)[2]),
left_index=True, right_index=True, how='left')
>>> df3
address_x address_y unique key
0 80 sad parkway 80 sad parkway Uniquekey4
1 240 happy lane NaN NaN
2 150 winter dr 150 spring drive Uniquekey2
3 123 nice rd 123 nice road Uniquekey1
这个答案比较长,但我会 post 因为你可以更好地跟进,因为你可以看到发生的步骤。
设置框架:
import pandas as pd
#pip install fuzzywuzzy
#pip install python-Levenshtein
from fuzzywuzzy import fuzz, process
# matching threshold. may need altering from 45-95 etc. higher is better but being stricter means things aren't matched. fiddle as required
threshold = 75
df1 = pd.DataFrame({'address': {0: '123 nice road',
1: '150 spring drive',
2: '240 happy lane',
3: '80 sad parkway'},
'unique key (and more columns)': {0: 'Uniquekey1',
1: 'Uniquekey2',
2: 'Uniquekey3',
3: 'Uniquekey4'}})
df2 = pd.DataFrame({'address': {0: '123 nice rd',
1: '150 spring dr',
2: '240 happy lane',
3: '80 sad parkway'},
'unique key (and more columns)': {0: 'Uniquekey1',
1: 'Uniquekey2',
2: 'Uniquekey3',
3: 'Uniquekey4'}})
然后主要代码:
# function used for fuzzywuzzy matching
def match_addresses(add, list_add, min_score=0):
max_score = -1
max_add = ''
for x in list_add:
score = fuzz.ratio(add, x)
if (score > min_score) & (score > max_score):
max_add = x
max_score = score
return (max_add, max_score)
# return the fuzzywuzzy score
def scoringMatches(x, s):
o = process.extractOne(x, s, score_cutoff = threshold)
if o != None:
return o[1]
# creating two lists from address column of both dataframes
df1_addresses = list(df1.address.unique())
df2_addresses = list(df2.address.unique())
# via fuzzywuzzy matching and using match_addresses() above
# return a dictionary of addresses where there is a match
names = []
for x in df1_addresses:
match = match_addresses(x, df2_addresses, threshold)
if match[1] >= threshold:
name = (str(x), str(match[0]))
names.append(name)
name_dict = dict(names)
# create new frame from fuzzywuzzy address matches dictionary
match_df = pd.DataFrame(name_dict.items(), columns=['df1_address', 'df2_address'])
# create new frame
df3 = pd.concat([df1, match_df], axis=1)
del df3['df1_address']
# shuffle the matched address column to be next to the original address of df1
c = df3.columns.tolist()
c.insert(1, c.pop(c.index('df2_address')))
df3 = df3.reindex(columns=c)
# add fuzzywuzzy scoring as a new column
df3['fuzzywuzzy_score'] = df3.apply(lambda x: scoringMatches(x['address'], df2['address']), axis=1)
print(df3)
输出:
address df2_address unique key (and more columns) fuzzywuzzy_score
0 123 nice road 123 nice rd Uniquekey1 92
1 150 spring drive 150 spring dr Uniquekey2 90
2 240 happy lane 240 happy lane Uniquekey3 100
3 80 sad parkway 80 sad parkway Uniquekey4 100