如何在 a.date<=b.date 上加入 pandas，然后只取 a.date 最大的行？

Question

我正在尝试按 ID 和日期连接两个数据框。但是，日期标准是 a.date<=b.date，如果 a.date 有很多结果，则取最大值（但仍然

Dataframe A (cumulative sales table)
ID| date       | cumulative_sales
1 | 2020-01-01 | 10
1 | 2020-01-03 | 15
1 | 2021-01-02 | 20

Dataframe B
ID| date       | cumulative_sales (up to this date, how much was purchased for a given ID?)
1 | 2020-05-01 | 15

在 SQL 中，我会通过 a.date<=b.date 进行连接，然后我接下来会进行 dense_rank() 并取其中的最大值每个 ID 的那个分区。不确定如何使用 Pandas 来解决这个问题。有什么建议吗？

Answer 1

我们可以merge

out = df1.merge(df2, on = 'ID', suffixes = ('','_x')).\
            query('date<=date_x').sort_values('date').drop_duplicates('ID',keep='last')[df1.columns]
Out[272]: 
   ID          date  cumulative_sales
1   1   2020-01-03                 15

Answer 2

这里有一种方法可以完成您的问题：

dfA = dfA.sort_values(['ID', 'date']).join(
    dfB.set_index('ID'), on='ID', rsuffix='_b').query('date <= date_b').drop(
    columns='date_b').groupby(['ID']).last().reset_index()

解释：

按 ID, date

dfA

使用 join 与 ID 上的 dfB 连接，并将 dfB 中的列带入后缀 _b
使用 query 仅保留 dfA.date <= dfB.date
在ID上使用groupby，然后last到select剩余值最高的行dfA.date（即最高dfA.date 即 <= dfB.date 每个 ID)
使用 reset_index 将 ID 从索引级别转换回列标签

完整测试代码：

import pandas as pd
dfA = pd.DataFrame({'ID':[1,1,1,2,2,2], 'date':['2020-01-01','2020-01-03','2020-01-02','2020-01-01','2020-01-03','2020-01-02'], 'cumulative_sales':[10,15,20,30,40,50]})
dfB = pd.DataFrame({'ID':[1,2], 'date':['2020-05-01','2020-01-01'], 'cumulative_sales':[15,30]})
print(dfA)
print(dfB)

dfA = dfA.sort_values(['ID', 'date']).join(
    dfB.set_index('ID'), on='ID', rsuffix='_b').query(
    'date <= date_b').drop(columns='date_b').groupby(['ID']).last().reset_index()
print(dfA)

输入：

dfA:

   ID        date  cumulative_sales
0   1  2020-01-01                10
1   1  2020-01-03                15
2   1  2020-01-02                20
3   2  2020-01-01                30
4   2  2020-01-03                40
5   2  2020-01-02                50

dfB:

   ID        date  cumulative_sales
0   1  2020-05-01                15
1   2  2020-01-01                30

输出：

   ID        date  cumulative_sales  cumulative_sales_b
0   1  2020-01-03                15                  15
1   2  2020-01-01                30                  30

注意：我已将 cumulative_sales_b 留在原地以备不时之需。如果不需要，可以通过将 drop(columns='date_b') 替换为 drop(columns=['date_b', 'cumulative_sales_b']).

来删除它

更新:

为了好玩，如果您的 python 版本有海象运算符 :=（也称为“条件赋值”运算符），您可以这样做而不是使用 query：

dfA = (dfA := dfA.sort_values(['ID', 'date']).join(
    dfB.set_index('ID'), on='ID', rsuffix='_b'))[dfA.date <= dfA.date_b].drop(
    columns='date_b').groupby(['ID']).last().reset_index()

Answer 3

看起来你只是想要一个 merge_asof:

dfA['date'] = pd.to_datetime(dfA['date'])
dfB['date'] = pd.to_datetime(dfB['date'])

out = pd.merge_asof(dfB.sort_values(by='date'),
                    dfA.sort_values(by='date'), 
                    on='date', by='ID')

如何在 a.date<=b.date 上加入 pandas，然后只取 a.date 最大的行？

How to join in pandas on a.date<=b.date and then taking the only the row where a.date is max?

python

join

pandas