与 pandas 合并:left_on 与日期合并 right_on 与具有该日期的最早时间范围

Merging with pandas : left_on with a date and right_on with the oldest time range having this date

让我们使用这两个样本数据帧:

df1 = pd.DataFrame({'Id':['A','A','B','C'], 'Date':["2020-03-01","2021-04-21","2020-12-10","2017-01-01"]})

  Id        Date
0  A  2020-03-01
1  A  2021-04-21
2  B  2020-12-10
3  C  2017-01-01

df2=pd.DataFrame({'Id':['A','A','B'], 'Start':["2020-01-01","2020-02-21","2019-12-10"],
                 'End':["2021-01-01","2021-02-21","2021-12-10"], "Value":[1,2,3]})

  Id       Start         End  Value
0  A  2020-01-01  2021-01-01      1
1  A  2020-02-21  2021-02-21      2
2  B  2019-12-10  2021-12-10      3

我想向 df1 添加一个值列。如果日期(在 df1 中)在开始和结束(在 df2 中)之间,则可以在 df2 中找到相应的值,具有相同的 Id。如果有几种可能性,我想取最早开始日期的值。

我目前使用 for 循环来执行此操作,但对于我真正的大数据框来说它非常慢,我的直觉是我们可以通过左连接来执行此操作,但我不知道如何操作。你有什么想法吗?

预期输出:

  Id        Date  Valeur
0  A  2020-03-01     1.0
1  A  2021-04-21     NaN
2  B  2020-12-10     3.0
3  C  2017-01-01     NaN

我最近回复了一个

让我们 make sure that dates are datetime 并按开始对 df2 进行排序:

df1['Date'] = pd.to_datetime(df1['Date'])
df2['Start'] = pd.to_datetime(df2['Start'])
df2['End'] = pd.to_datetime(df2['End'])
df2.sort_values(by='Start', inplace=True)

使 df2 的索引成为 IntervalIndex:

df2.index = pd.IntervalIndex.from_arrays(df2['Start'], df2['End'],closed='both')

制作自定义函数并应用于行:

def get_date(s):
    try:
        d = df2.loc[s['Date']]
        return d[d['Id'] == s['Id']].iloc[0]['Value']
    except KeyError:
        pass

df1['Value'] = df1.apply(get_date, axis=1)

输出:

  Id       Date  Value
0  A 2020-03-01    1.0
1  A 2021-04-21    NaN
2  B 2020-12-10    3.0
3  C 2017-01-01    NaN

使用.merge() + .between() + drop_duplicates():

# Sort if not already in `Id`, `Start` order
#df2 = df2.sort_values(by=['Id', 'Start'])

df3 = df1.merge(df2, on='Id')
df3_filtered = df3.loc[df3['Date'].between(df3['Start'], df3['End'])]

df4 = df3_filtered.drop_duplicates(['Id', 'Date'], keep='first')

df_out = df1.merge(df4[['Id', 'Date', 'Value']], how='left')

结果:

print(df_out)

  Id        Date  Value
0  A  2020-03-01    1.0
1  A  2021-04-21    NaN
2  B  2020-12-10    3.0
3  C  2017-01-01    NaN