Pandas:在一列上近似连接,在其他列上精确匹配
Pandas: Approximate join on one column, exact match on other columns
我有两个 pandas 数据帧,我想 join/merge 恰好在多列(比如 3)上,大约,即最近的邻居,在一个(日期)列上。我也想 return 它们之间的差异(天数)。每个数据集大约有 50,000 行长。我对内部联接最感兴趣,但是“剩余”即使不太难掌握也很有趣。大多数“精确匹配”观察结果会在每个数据框中多次出现。
我一直在尝试使用 difflib.get_close_matches 将所有字符串连接起来(我知道这很愚蠢!)但并不总是如此给出精确匹配。我想我需要先遍历完全匹配,然后在该组中找到最接近的匹配,但我似乎做对了...
数据框类似于:
df1 = pd.DataFrame({'index': ['a1','a2','a3','a4'], 'col1': ['1232','432','432','123'], 'col2': ['asd','dsa12','dsa12','asd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-20','2010-06-20','2008-10-21'],}).set_index('index')
df1
Out[430]:
col1 col2 col3 date
index
a1 1232 asd 1 2010-01-23
a2 432 dsa12 2 2016-05-20
a3 432 dsa12 2 2010-06-20
a4 123 asd2 3 2008-10-21
df2 = pd.DataFrame({'index': ['b1','b2','b3','b4'], 'col1': ['132','432','432','123'], 'col2': ['asd','dsa12','dsa12','sd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-23','2010-06-10','2008-10-21'],}).set_index('index')
df2
Out[434]:
col1 col2 col3 date
index
b1 132 asd 1 2010-01-23
b2 432 dsa12 2 2016-05-23
b3 432 dsa12 2 2010-06-10
b4 123 sd2 3 2008-10-21
最后我想要的是:
col1 col2 col3 date diff match_index
index
a1 1232 asd 1 2010-01-23 nan nan
a2 432 dsa12 2 2016-05-20 -3 b2
a3 432 dsa12 2 2010-06-20 10 b3
a4 123 asd2 3 2008-10-21 nan nan
a5 123 sd2 3 2008-10-21 nan b4
或者如果只使用内部连接更容易,我想要:
col1 col2 col3 date diff match_index
index
a2 432 dsa12 2 2016-05-20 -3 b2
a3 432 dsa12 2 2010-06-20 10 b3
我不确定这是否适合。它或多或少地实现了你想要的,但实际上并没有执行合并。它遵循与此 相同的想法,除了不是仅基于一列对 df1
进行子集化,这里我们使用 groupby
在多个列上进行匹配,并在两个数据帧上进行。如果您确实想明确包含 merge
命令并且对内部联接感到满意,请检查答案的最底部,它包含一个片段。
import pandas as pd
from sklearn.neighbors import NearestNeighbors
def find_nearest(group, df2, groupname):
try:
match = df2.groupby(groupname).get_group(group.name)
match['date'] = pd.to_datetime(match.date, unit = 'D')
nbrs = NearestNeighbors(1).fit(match['date'].values[:, None])
dist, ind = nbrs.kneighbors(group['date'].values[:, None])
group['date1'] = group['date']
group['date'] = match['date'].values[ind.ravel()]
group['diff'] = (group['date1']-group['date'])
group['match_index'] = match.index[ind.ravel()]
return group
except KeyError:
return group
#change dates from string to datetime
df1['date'] = pd.to_datetime(df1.date, unit = 'D')
df2['date'] = pd.to_datetime(df2.date, unit = 'D')
#find closest dates and differences
keys = ['col1', 'col2', 'col3']
df1_mod = df1.groupby(keys).apply(find_nearest, df2, keys)
#fill unmatched dates
df1_mod.date1.fillna(df1_mod.date, inplace=True)
df2_mod = df2.groupby(keys).apply(find_nearest, df1, keys)
df2_mod.date1.fillna(df2_mod.date, inplace=True)
#drop original column
df1_mod.drop('date', inplace=True, axis=1)
df1_mod.rename(columns = {'date1':'date'}, inplace=True)
df2_mod.drop('date', inplace=True, axis=1)
df2_mod.rename(columns = {'date1':'date'}, inplace=True)
df2_mod['diff'] = -df2_mod['diff']
#drop redundant values
df2_mod.drop(df2_mod[df2_mod.match_index.str.len()>0].index, inplace=True)
#merge the two
df_final = pd.merge(df1_mod, df2_mod, how='outer')
这会产生以下结果:
In [349]: df_final
Out[349]:
col1 col2 col3 date diff match_index
0 1232 asd 1 2010-01-23 NaT NaN
1 432 dsa12 2 2016-05-20 -3 days b2
2 432 dsa12 2 2010-06-20 10 days b3
3 123 asd2 3 2008-10-21 NaT NaN
4 132 asd 1 2010-01-23 NaT NaN
5 123 sd2 3 2008-10-21 NaT NaN
使用合并命令:
In [208]: pd.merge(df1_mod, df2.drop('date', axis=1), on=['col1', 'col2', 'col3']).drop_duplicates()
Out[208]:
col1 col2 col3 date diff match_index
0 432 dsa12 2 2016-05-20 -3 days b2
2 432 dsa12 2 2010-06-20 10 days b3
评论中考虑的案例,即:
df1 = pd.DataFrame({'index': ['a1','a2','a3','a4'], 'col1': ['1232','1432','432','123'], 'col2': ['asd','dsa12','dsa12','asd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-20','2010-06-20','2008-10-21'],}).set_index('index')
产生以下结果:
In [351]: df_final
Out[351]:
col1 col2 col3 date diff match_index
0 1232 asd 1 2010-01-23 NaT NaN
1 1432 dsa12 2 2016-05-20 NaT NaN
2 432 dsa12 2 2010-06-20 10 days b3
3 123 asd2 3 2008-10-21 NaT NaN
4 132 asd 1 2010-01-23 NaT NaN
5 123 sd2 3 2008-10-21 NaT NaN
我有两个 pandas 数据帧,我想 join/merge 恰好在多列(比如 3)上,大约,即最近的邻居,在一个(日期)列上。我也想 return 它们之间的差异(天数)。每个数据集大约有 50,000 行长。我对内部联接最感兴趣,但是“剩余”即使不太难掌握也很有趣。大多数“精确匹配”观察结果会在每个数据框中多次出现。
我一直在尝试使用 difflib.get_close_matches 将所有字符串连接起来(我知道这很愚蠢!)但并不总是如此给出精确匹配。我想我需要先遍历完全匹配,然后在该组中找到最接近的匹配,但我似乎做对了...
数据框类似于:
df1 = pd.DataFrame({'index': ['a1','a2','a3','a4'], 'col1': ['1232','432','432','123'], 'col2': ['asd','dsa12','dsa12','asd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-20','2010-06-20','2008-10-21'],}).set_index('index')
df1
Out[430]:
col1 col2 col3 date
index
a1 1232 asd 1 2010-01-23
a2 432 dsa12 2 2016-05-20
a3 432 dsa12 2 2010-06-20
a4 123 asd2 3 2008-10-21
df2 = pd.DataFrame({'index': ['b1','b2','b3','b4'], 'col1': ['132','432','432','123'], 'col2': ['asd','dsa12','dsa12','sd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-23','2010-06-10','2008-10-21'],}).set_index('index')
df2
Out[434]:
col1 col2 col3 date
index
b1 132 asd 1 2010-01-23
b2 432 dsa12 2 2016-05-23
b3 432 dsa12 2 2010-06-10
b4 123 sd2 3 2008-10-21
最后我想要的是:
col1 col2 col3 date diff match_index
index
a1 1232 asd 1 2010-01-23 nan nan
a2 432 dsa12 2 2016-05-20 -3 b2
a3 432 dsa12 2 2010-06-20 10 b3
a4 123 asd2 3 2008-10-21 nan nan
a5 123 sd2 3 2008-10-21 nan b4
或者如果只使用内部连接更容易,我想要:
col1 col2 col3 date diff match_index
index
a2 432 dsa12 2 2016-05-20 -3 b2
a3 432 dsa12 2 2010-06-20 10 b3
我不确定这是否适合。它或多或少地实现了你想要的,但实际上并没有执行合并。它遵循与此 df1
进行子集化,这里我们使用 groupby
在多个列上进行匹配,并在两个数据帧上进行。如果您确实想明确包含 merge
命令并且对内部联接感到满意,请检查答案的最底部,它包含一个片段。
import pandas as pd
from sklearn.neighbors import NearestNeighbors
def find_nearest(group, df2, groupname):
try:
match = df2.groupby(groupname).get_group(group.name)
match['date'] = pd.to_datetime(match.date, unit = 'D')
nbrs = NearestNeighbors(1).fit(match['date'].values[:, None])
dist, ind = nbrs.kneighbors(group['date'].values[:, None])
group['date1'] = group['date']
group['date'] = match['date'].values[ind.ravel()]
group['diff'] = (group['date1']-group['date'])
group['match_index'] = match.index[ind.ravel()]
return group
except KeyError:
return group
#change dates from string to datetime
df1['date'] = pd.to_datetime(df1.date, unit = 'D')
df2['date'] = pd.to_datetime(df2.date, unit = 'D')
#find closest dates and differences
keys = ['col1', 'col2', 'col3']
df1_mod = df1.groupby(keys).apply(find_nearest, df2, keys)
#fill unmatched dates
df1_mod.date1.fillna(df1_mod.date, inplace=True)
df2_mod = df2.groupby(keys).apply(find_nearest, df1, keys)
df2_mod.date1.fillna(df2_mod.date, inplace=True)
#drop original column
df1_mod.drop('date', inplace=True, axis=1)
df1_mod.rename(columns = {'date1':'date'}, inplace=True)
df2_mod.drop('date', inplace=True, axis=1)
df2_mod.rename(columns = {'date1':'date'}, inplace=True)
df2_mod['diff'] = -df2_mod['diff']
#drop redundant values
df2_mod.drop(df2_mod[df2_mod.match_index.str.len()>0].index, inplace=True)
#merge the two
df_final = pd.merge(df1_mod, df2_mod, how='outer')
这会产生以下结果:
In [349]: df_final
Out[349]:
col1 col2 col3 date diff match_index
0 1232 asd 1 2010-01-23 NaT NaN
1 432 dsa12 2 2016-05-20 -3 days b2
2 432 dsa12 2 2010-06-20 10 days b3
3 123 asd2 3 2008-10-21 NaT NaN
4 132 asd 1 2010-01-23 NaT NaN
5 123 sd2 3 2008-10-21 NaT NaN
使用合并命令:
In [208]: pd.merge(df1_mod, df2.drop('date', axis=1), on=['col1', 'col2', 'col3']).drop_duplicates()
Out[208]:
col1 col2 col3 date diff match_index
0 432 dsa12 2 2016-05-20 -3 days b2
2 432 dsa12 2 2010-06-20 10 days b3
评论中考虑的案例,即:
df1 = pd.DataFrame({'index': ['a1','a2','a3','a4'], 'col1': ['1232','1432','432','123'], 'col2': ['asd','dsa12','dsa12','asd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-20','2010-06-20','2008-10-21'],}).set_index('index')
产生以下结果:
In [351]: df_final
Out[351]:
col1 col2 col3 date diff match_index
0 1232 asd 1 2010-01-23 NaT NaN
1 1432 dsa12 2 2016-05-20 NaT NaN
2 432 dsa12 2 2010-06-20 10 days b3
3 123 asd2 3 2008-10-21 NaT NaN
4 132 asd 1 2010-01-23 NaT NaN
5 123 sd2 3 2008-10-21 NaT NaN