Python Pandas - group by keys 与重复值的区别

Python Pandas - Difference between groupby keys with repeated valyes

我有一些关于向客户销售日期的数据。 数据如下所示:

Cod client Items Date
0 100 1 2022/01/01
1 100 7 2022/01/01
2 100 2 2022/02/01
3 101 5 2022/01/01
4 101 8 2022/02/01
5 101 10 2022/02/01
6 101 2 2022/04/01
7 101 2 2022/04/01
8 102 4 2022/02/01
9 102 10 2022/03/01

我要完成的是计算每个客户日期之间的差异:首先按“Cod client”分组,然后按“Date”分组(因为重复)

预期结果如下:

Cod client Items Date Date diff Explain
0 100 1 2022/01/01 NaT First date for client 100
1 100 7 2022/01/01 NaT ...repeat above
2 100 2 2022/02/01 31 Diff from first date 2022/01/01
3 101 5 2022/01/01 NaT Fist date for client 101
4 101 8 2022/02/01 31 Diff from first date 2022/01/01
5 101 10 2022/02/01 31 ...repeat above
6 101 2 2022/04/01 59 Diff from previous date 2022/02/01
7 101 2 2022/04/01 59 ...repeat above
8 102 4 2022/02/01 NaT First date for client 102
9 102 10 2022/03/01 28 Diff from first date 2022/02/01

我已经尝试过 df["Date diff"] = df.groupby("Cod client")["Date"].diff() 但它考虑了重复日期和 return 零

感谢您的帮助!

IIUC 您可以组合多个 groupby 操作:

# ensure datetime
df['Date'] = pd.to_datetime(df['Date'])

# set up group
g = df.groupby('Cod client')

# identify duplicated dates per group
m = g['Date'].apply(pd.Series.duplicated)

# compute the diff, mask and ffill
df['Date diff'] = g['Date'].diff().mask(m).groupby(df['Cod client']).ffill()

输出:

   Cod client  Items       Date Date diff
0         100      1 2022-01-01       NaT
1         100      7 2022-01-01       NaT
2         100      2 2022-02-01   31 days
3         101      5 2022-01-01       NaT
4         101      8 2022-02-01   31 days
5         101     10 2022-02-01   31 days
6         101      2 2022-04-01   59 days
7         101      2 2022-04-01   59 days
8         102      4 2022-02-01       NaT
9         102     10 2022-03-01   28 days

另一种方法,transform:

import pandas as pd
# data saved as .csv
df = pd.read_csv("Data.csv", header=0, parse_dates=True)
# convert Date column to correct date.
df["Date"] = pd.to_datetime(df["Date"], format="%d/%m/%Y")
# new column!
df["Date diff"] = df.sort_values("Date").groupby("Cod client")["Date"].transform(lambda x: x.diff().replace("0 days", pd.NaT).ffill())