Python Pandas - group by keys 与重复值的区别
Python Pandas - Difference between groupby keys with repeated valyes
我有一些关于向客户销售日期的数据。
数据如下所示:
Cod client
Items
Date
0
100
1
2022/01/01
1
100
7
2022/01/01
2
100
2
2022/02/01
3
101
5
2022/01/01
4
101
8
2022/02/01
5
101
10
2022/02/01
6
101
2
2022/04/01
7
101
2
2022/04/01
8
102
4
2022/02/01
9
102
10
2022/03/01
我要完成的是计算每个客户日期之间的差异:首先按“Cod client”分组,然后按“Date”分组(因为重复)
预期结果如下:
Cod client
Items
Date
Date diff
Explain
0
100
1
2022/01/01
NaT
First date for client 100
1
100
7
2022/01/01
NaT
...repeat above
2
100
2
2022/02/01
31
Diff from first date 2022/01/01
3
101
5
2022/01/01
NaT
Fist date for client 101
4
101
8
2022/02/01
31
Diff from first date 2022/01/01
5
101
10
2022/02/01
31
...repeat above
6
101
2
2022/04/01
59
Diff from previous date 2022/02/01
7
101
2
2022/04/01
59
...repeat above
8
102
4
2022/02/01
NaT
First date for client 102
9
102
10
2022/03/01
28
Diff from first date 2022/02/01
我已经尝试过 df["Date diff"] = df.groupby("Cod client")["Date"].diff()
但它考虑了重复日期和 return 零
感谢您的帮助!
IIUC 您可以组合多个 groupby
操作:
# ensure datetime
df['Date'] = pd.to_datetime(df['Date'])
# set up group
g = df.groupby('Cod client')
# identify duplicated dates per group
m = g['Date'].apply(pd.Series.duplicated)
# compute the diff, mask and ffill
df['Date diff'] = g['Date'].diff().mask(m).groupby(df['Cod client']).ffill()
输出:
Cod client Items Date Date diff
0 100 1 2022-01-01 NaT
1 100 7 2022-01-01 NaT
2 100 2 2022-02-01 31 days
3 101 5 2022-01-01 NaT
4 101 8 2022-02-01 31 days
5 101 10 2022-02-01 31 days
6 101 2 2022-04-01 59 days
7 101 2 2022-04-01 59 days
8 102 4 2022-02-01 NaT
9 102 10 2022-03-01 28 days
另一种方法,transform
:
import pandas as pd
# data saved as .csv
df = pd.read_csv("Data.csv", header=0, parse_dates=True)
# convert Date column to correct date.
df["Date"] = pd.to_datetime(df["Date"], format="%d/%m/%Y")
# new column!
df["Date diff"] = df.sort_values("Date").groupby("Cod client")["Date"].transform(lambda x: x.diff().replace("0 days", pd.NaT).ffill())
我有一些关于向客户销售日期的数据。 数据如下所示:
Cod client | Items | Date | |
---|---|---|---|
0 | 100 | 1 | 2022/01/01 |
1 | 100 | 7 | 2022/01/01 |
2 | 100 | 2 | 2022/02/01 |
3 | 101 | 5 | 2022/01/01 |
4 | 101 | 8 | 2022/02/01 |
5 | 101 | 10 | 2022/02/01 |
6 | 101 | 2 | 2022/04/01 |
7 | 101 | 2 | 2022/04/01 |
8 | 102 | 4 | 2022/02/01 |
9 | 102 | 10 | 2022/03/01 |
我要完成的是计算每个客户日期之间的差异:首先按“Cod client”分组,然后按“Date”分组(因为重复)
预期结果如下:
Cod client | Items | Date | Date diff | Explain | |
---|---|---|---|---|---|
0 | 100 | 1 | 2022/01/01 | NaT | First date for client 100 |
1 | 100 | 7 | 2022/01/01 | NaT | ...repeat above |
2 | 100 | 2 | 2022/02/01 | 31 | Diff from first date 2022/01/01 |
3 | 101 | 5 | 2022/01/01 | NaT | Fist date for client 101 |
4 | 101 | 8 | 2022/02/01 | 31 | Diff from first date 2022/01/01 |
5 | 101 | 10 | 2022/02/01 | 31 | ...repeat above |
6 | 101 | 2 | 2022/04/01 | 59 | Diff from previous date 2022/02/01 |
7 | 101 | 2 | 2022/04/01 | 59 | ...repeat above |
8 | 102 | 4 | 2022/02/01 | NaT | First date for client 102 |
9 | 102 | 10 | 2022/03/01 | 28 | Diff from first date 2022/02/01 |
我已经尝试过 df["Date diff"] = df.groupby("Cod client")["Date"].diff()
但它考虑了重复日期和 return 零
感谢您的帮助!
IIUC 您可以组合多个 groupby
操作:
# ensure datetime
df['Date'] = pd.to_datetime(df['Date'])
# set up group
g = df.groupby('Cod client')
# identify duplicated dates per group
m = g['Date'].apply(pd.Series.duplicated)
# compute the diff, mask and ffill
df['Date diff'] = g['Date'].diff().mask(m).groupby(df['Cod client']).ffill()
输出:
Cod client Items Date Date diff
0 100 1 2022-01-01 NaT
1 100 7 2022-01-01 NaT
2 100 2 2022-02-01 31 days
3 101 5 2022-01-01 NaT
4 101 8 2022-02-01 31 days
5 101 10 2022-02-01 31 days
6 101 2 2022-04-01 59 days
7 101 2 2022-04-01 59 days
8 102 4 2022-02-01 NaT
9 102 10 2022-03-01 28 days
另一种方法,transform
:
import pandas as pd
# data saved as .csv
df = pd.read_csv("Data.csv", header=0, parse_dates=True)
# convert Date column to correct date.
df["Date"] = pd.to_datetime(df["Date"], format="%d/%m/%Y")
# new column!
df["Date diff"] = df.sort_values("Date").groupby("Cod client")["Date"].transform(lambda x: x.diff().replace("0 days", pd.NaT).ffill())