如何聚合 DataFrame 以保留日期最高的行并在 Python Pandas 中添加新列？

Question

我有如下所示的 Python Pandas DataFrame（“date_col”采用“datetime64”格式）：

ID  | date_col   | purchase
----|------------|-------
111 | 2019-01-05 | apple
111 | 2019-05-22 | onion
222 | 2020-11-04 | banana
333 | 2020-04-19 | orange

我需要按以下方式对 table 以上进行汇总：

添加列“col1”，其中包含客户的购买次数（“ID”）
如果某些客户（“ID”）重复 - 只保留日期最高的一行

因此，我需要如下内容：

ID  | date_col   | purchase | col1
----|------------|----------|-----
111 | 2019-05-22 | onion    | 2
222 | 2020-11-04 | banana   | 1
333 | 2020-04-19 | orange   | 1

Answer 1

假设数据帧按 date_col 列排序，您可以使用 groupby:

g = df.groupby('ID', as_index=False)
g.last().merge(g.size())

    ID    date_col purchase  size
0  111  2019-05-22    onion     2
1  222  2020-11-04   banana     1
2  333  2020-04-19   orange     1

Answer 2

这是一种方法：

df['col1'] = df.groupby('ID')['ID'].transform('count')
df = df.sort_values('date_col').groupby('ID').tail(1)

输出：

>>
    ID    date_col purchase  col1
1  111  2019-05-22    onion     2
3  333  2020-04-19   orange     1
2  222  2020-11-04   banana     1

Answer 3

您可以尝试使用 groupby.transform 创建一个新的 count 列，并通过选择 groupby.idmax

来获取最大日期

df['date_col'] = pd.to_datetime(df['date_col'])
df = (df.assign(col1=df.groupby('ID')['purchase'].transform('count'))
      .loc[lambda df: df.groupby('ID')['date_col'].idxmax()])

print(df)

    ID   date_col purchase  col1
1  111 2019-05-22    onion     2
2  222 2020-11-04   banana     1
3  333 2020-04-19   orange     1

Answer 4

df['col1'] = df.groupby('ID')['ID'].transform('count')
df.sort_values('date_col').drop_duplicates('ID',keep='last')

如何聚合 DataFrame 以保留日期最高的行并在 Python Pandas 中添加新列？

How to aggregate DataFrame to stay rows with the highest date and add new column in Python Pandas?

python

aggregate

aggregate-functions

pandas