与 pandas 合并:left_on 与日期合并 right_on 与具有该日期的最早时间范围
Merging with pandas : left_on with a date and right_on with the oldest time range having this date
让我们使用这两个样本数据帧:
df1 = pd.DataFrame({'Id':['A','A','B','C'], 'Date':["2020-03-01","2021-04-21","2020-12-10","2017-01-01"]})
Id Date
0 A 2020-03-01
1 A 2021-04-21
2 B 2020-12-10
3 C 2017-01-01
df2=pd.DataFrame({'Id':['A','A','B'], 'Start':["2020-01-01","2020-02-21","2019-12-10"],
'End':["2021-01-01","2021-02-21","2021-12-10"], "Value":[1,2,3]})
Id Start End Value
0 A 2020-01-01 2021-01-01 1
1 A 2020-02-21 2021-02-21 2
2 B 2019-12-10 2021-12-10 3
我想向 df1 添加一个值列。如果日期(在 df1 中)在开始和结束(在 df2 中)之间,则可以在 df2 中找到相应的值,具有相同的 Id。如果有几种可能性,我想取最早开始日期的值。
我目前使用 for 循环来执行此操作,但对于我真正的大数据框来说它非常慢,我的直觉是我们可以通过左连接来执行此操作,但我不知道如何操作。你有什么想法吗?
预期输出:
Id Date Valeur
0 A 2020-03-01 1.0
1 A 2021-04-21 NaN
2 B 2020-12-10 3.0
3 C 2017-01-01 NaN
我最近回复了一个。
让我们 make sure that dates are datetime 并按开始对 df2 进行排序:
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Start'] = pd.to_datetime(df2['Start'])
df2['End'] = pd.to_datetime(df2['End'])
df2.sort_values(by='Start', inplace=True)
使 df2 的索引成为 IntervalIndex:
df2.index = pd.IntervalIndex.from_arrays(df2['Start'], df2['End'],closed='both')
制作自定义函数并应用于行:
def get_date(s):
try:
d = df2.loc[s['Date']]
return d[d['Id'] == s['Id']].iloc[0]['Value']
except KeyError:
pass
df1['Value'] = df1.apply(get_date, axis=1)
输出:
Id Date Value
0 A 2020-03-01 1.0
1 A 2021-04-21 NaN
2 B 2020-12-10 3.0
3 C 2017-01-01 NaN
使用.merge()
+ .between()
+ drop_duplicates()
:
# Sort if not already in `Id`, `Start` order
#df2 = df2.sort_values(by=['Id', 'Start'])
df3 = df1.merge(df2, on='Id')
df3_filtered = df3.loc[df3['Date'].between(df3['Start'], df3['End'])]
df4 = df3_filtered.drop_duplicates(['Id', 'Date'], keep='first')
df_out = df1.merge(df4[['Id', 'Date', 'Value']], how='left')
结果:
print(df_out)
Id Date Value
0 A 2020-03-01 1.0
1 A 2021-04-21 NaN
2 B 2020-12-10 3.0
3 C 2017-01-01 NaN
让我们使用这两个样本数据帧:
df1 = pd.DataFrame({'Id':['A','A','B','C'], 'Date':["2020-03-01","2021-04-21","2020-12-10","2017-01-01"]})
Id Date
0 A 2020-03-01
1 A 2021-04-21
2 B 2020-12-10
3 C 2017-01-01
df2=pd.DataFrame({'Id':['A','A','B'], 'Start':["2020-01-01","2020-02-21","2019-12-10"],
'End':["2021-01-01","2021-02-21","2021-12-10"], "Value":[1,2,3]})
Id Start End Value
0 A 2020-01-01 2021-01-01 1
1 A 2020-02-21 2021-02-21 2
2 B 2019-12-10 2021-12-10 3
我想向 df1 添加一个值列。如果日期(在 df1 中)在开始和结束(在 df2 中)之间,则可以在 df2 中找到相应的值,具有相同的 Id。如果有几种可能性,我想取最早开始日期的值。
我目前使用 for 循环来执行此操作,但对于我真正的大数据框来说它非常慢,我的直觉是我们可以通过左连接来执行此操作,但我不知道如何操作。你有什么想法吗?
预期输出:
Id Date Valeur
0 A 2020-03-01 1.0
1 A 2021-04-21 NaN
2 B 2020-12-10 3.0
3 C 2017-01-01 NaN
我最近回复了一个
让我们 make sure that dates are datetime 并按开始对 df2 进行排序:
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Start'] = pd.to_datetime(df2['Start'])
df2['End'] = pd.to_datetime(df2['End'])
df2.sort_values(by='Start', inplace=True)
使 df2 的索引成为 IntervalIndex:
df2.index = pd.IntervalIndex.from_arrays(df2['Start'], df2['End'],closed='both')
制作自定义函数并应用于行:
def get_date(s):
try:
d = df2.loc[s['Date']]
return d[d['Id'] == s['Id']].iloc[0]['Value']
except KeyError:
pass
df1['Value'] = df1.apply(get_date, axis=1)
输出:
Id Date Value
0 A 2020-03-01 1.0
1 A 2021-04-21 NaN
2 B 2020-12-10 3.0
3 C 2017-01-01 NaN
使用.merge()
+ .between()
+ drop_duplicates()
:
# Sort if not already in `Id`, `Start` order
#df2 = df2.sort_values(by=['Id', 'Start'])
df3 = df1.merge(df2, on='Id')
df3_filtered = df3.loc[df3['Date'].between(df3['Start'], df3['End'])]
df4 = df3_filtered.drop_duplicates(['Id', 'Date'], keep='first')
df_out = df1.merge(df4[['Id', 'Date', 'Value']], how='left')
结果:
print(df_out)
Id Date Value
0 A 2020-03-01 1.0
1 A 2021-04-21 NaN
2 B 2020-12-10 3.0
3 C 2017-01-01 NaN