Pandas:具有指定范围的键之间的差异的外连接
Pandas: Outer join with a specified range of difference between the keys
我想对键为 id: int
和 date: pd.Timestamp
对象的两个数据框执行外连接。最重要的是,如果 ids
相同(正常行为)并且日期相等(正常行为)或者日期之间的差异最大为 30 天,我希望将键视为相等.然后,当执行外部连接时,应从右侧数据帧中获取 date
。示例如下:
left = pd.DataFrame({"id": [1, 2, 3], "date": [pd.Timestamp(2002, 3, 25), pd.Timestamp(2003, 4, 4), pd.Timestamp(2004, 6, 6)], "val_3": [77, 88, 11]})
right = pd.DataFrame({"id": [1, 2, 3], "date": [pd.Timestamp(2002, 3, 10), pd.Timestamp(2003, 4, 27), pd.Timestamp(2004, 5, 5)], "val_1": [99, 66, 33], "val_2": [101, 102, 103]})
加入后的结果应该是:
result = pd.DataFrame({"id": [1, 2, 3, 3], "date": [pd.Timestamp(2002, 3, 10), pd.Timestamp(2003, 4, 27), pd.Timestamp(2004, 6, 6), pd.Timestamp(2004, 5, 5)], "val_3": [77, 88, 11, np.nan], "val_1": [99, 66, np.nan, 33], "val_2": [101, 102, np.nan, 103]})
期待您的回答!
我认为 merge
在 'id'
上,然后如果日期不在 30 天内
,则根据需要拆分 DataFrame
import pandas as pd
# Rename so it's easier to split columns later
left = left.rename(columns={'date': 'date_l'})
m = left.merge(right, on='id', how='outer')
mask = m.date >= m.date_l - pd.Timedelta(days=30)
pd.concat([
m[mask].drop(columns='date_l'),
m.loc[~mask, left.columns].rename(columns={'date_l': 'date'}),
m.loc[~mask, right.columns]],
ignore_index=True, sort=False)
输出:
id val_3 date val_1 val_2
0 1 77.0 2002-03-10 99.0 101.0
1 2 88.0 2003-04-27 66.0 102.0
2 3 11.0 2004-06-06 NaN NaN
3 3 NaN 2004-05-05 33.0 103.0
我想对键为 id: int
和 date: pd.Timestamp
对象的两个数据框执行外连接。最重要的是,如果 ids
相同(正常行为)并且日期相等(正常行为)或者日期之间的差异最大为 30 天,我希望将键视为相等.然后,当执行外部连接时,应从右侧数据帧中获取 date
。示例如下:
left = pd.DataFrame({"id": [1, 2, 3], "date": [pd.Timestamp(2002, 3, 25), pd.Timestamp(2003, 4, 4), pd.Timestamp(2004, 6, 6)], "val_3": [77, 88, 11]})
right = pd.DataFrame({"id": [1, 2, 3], "date": [pd.Timestamp(2002, 3, 10), pd.Timestamp(2003, 4, 27), pd.Timestamp(2004, 5, 5)], "val_1": [99, 66, 33], "val_2": [101, 102, 103]})
加入后的结果应该是:
result = pd.DataFrame({"id": [1, 2, 3, 3], "date": [pd.Timestamp(2002, 3, 10), pd.Timestamp(2003, 4, 27), pd.Timestamp(2004, 6, 6), pd.Timestamp(2004, 5, 5)], "val_3": [77, 88, 11, np.nan], "val_1": [99, 66, np.nan, 33], "val_2": [101, 102, np.nan, 103]})
期待您的回答!
我认为 merge
在 'id'
上,然后如果日期不在 30 天内
DataFrame
import pandas as pd
# Rename so it's easier to split columns later
left = left.rename(columns={'date': 'date_l'})
m = left.merge(right, on='id', how='outer')
mask = m.date >= m.date_l - pd.Timedelta(days=30)
pd.concat([
m[mask].drop(columns='date_l'),
m.loc[~mask, left.columns].rename(columns={'date_l': 'date'}),
m.loc[~mask, right.columns]],
ignore_index=True, sort=False)
输出:
id val_3 date val_1 val_2
0 1 77.0 2002-03-10 99.0 101.0
1 2 88.0 2003-04-27 66.0 102.0
2 3 11.0 2004-06-06 NaN NaN
3 3 NaN 2004-05-05 33.0 103.0