基于滚动周期的上一行的最大值 pandas
Maximum value from previous row based on rolling period pandas
我有如下数据集:
data = pd.DataFrame({
'ID': ['27459', '27459', '27459', '27459', '27459', '27459', '27459', '48002', '48002', '48002'],
'Invoice_Date': ['2020-06-26', '2020-06-29', '2020-06-30', '2020-07-14', '2020-07-25',
'2020-07-30', '2020-08-02', '2020-05-13', '2020-06-20', '2020-06-28'],
'Payment_Term': [7,8,3,6,4,7,8,5,3,6],
'Payment_Date': ['2020-07-05', '2020-07-05','2020-07-03', '2020-07-21', '2020-07-31',
'2020-08-15', '2020-08-22', '2020-06-16', '2020-06-23', '2020-07-05'],
'Due_Date': ['2020-07-03', '2020-07-07', '2020-07-03', '2020-07-20', '2020-07-29',
'2020-08-06', '2020-08-10', '2020-05-18', '2020-06-23', '2020-07-04'],
'Delay': [2,-2,0,1,2,9,12,29,0,1],
'Difference_Date': [0,3,1,14,11,5,3,0,38,8],
})
data
我需要添加另一列 Max
以显示前 Delay
行的最大值。它还有一个条件,就是应该有30天的滚动期。这意味着,对于当前行中的 Max
,将从当前行 Invoice_Date
.
开始的 30 天时间段内的前一行采取最大延迟
所需的输出是:
ID Invoice_Date Payment_Term Payment_Date Due_Date Delay Difference_Date Max
27459 2020-06-26 7 2020-07-05 2020-07-03 2 0 0
27459 2020-06-29 8 2020-07-05 2020-07-07 -2 3 2
27459 2020-06-30 3 2020-07-03 2020-07-03 0 1 2
27459 2020-07-14 6 2020-07-21 2020-07-20 1 14 2
27459 2020-07-25 4 2020-07-31 2020-07-29 2 11 2
27459 2020-07-30 7 2020-08-15 2020-08-06 9 5 2
27459 2020-08-02 8 2020-08-22 2020-08-10 12 3 9
48002 2020-05-13 5 2020-06-16 2020-05-18 29 0 0
48002 2020-06-20 3 2020-06-23 2020-06-23 0 38 29
48002 2020-06-28 6 2020-07-05 2020-07-04 1 8 29
一种可行的方法:
data['Invoice_Date'] = pd.to_datetime(data['Invoice_Date'])
groups = data.groupby('ID')
for group_name, df_group in groups:
for idx,row in df_group.iterrows():
dt_range = pd.date_range(row['Invoice_Date'] - pd.to_timedelta(30, 'day'), row['Invoice_Date'])[:-1]
data.loc[idx, 'max'] = df_group[df_group.Invoice_Date.isin(dt_range)].Delay.max()
print(data)
输出:
ID Invoice_Date Payment_Term Payment_Date Due_Date Delay Difference_Date max
0 27459 2020-06-26 7 2020-07-05 2020-07-03 2 0 NaN
1 27459 2020-06-29 8 2020-07-05 2020-07-07 -2 3 2.0
2 27459 2020-06-30 3 2020-07-03 2020-07-03 0 1 2.0
3 27459 2020-07-14 6 2020-07-21 2020-07-20 1 14 2.0
4 27459 2020-07-25 4 2020-07-31 2020-07-29 2 11 2.0
5 27459 2020-07-30 7 2020-08-15 2020-08-06 9 5 2.0
6 27459 2020-08-02 8 2020-08-22 2020-08-10 12 3 9.0
7 48002 2020-05-13 5 2020-06-16 2020-05-18 29 0 NaN
8 48002 2020-06-20 3 2020-06-23 2020-06-23 0 38 NaN
9 48002 2020-06-28 6 2020-07-05 2020-07-04 1 8 0.0
您可以用 data.fillna(0)
填充 NaN。看到 ID“48002”的第一个值是 NaN,因为之前的值不在 30 天的范围内。
您可以使用rolling
方法只对一些过去的元素进行操作。但是,日期应该是单调的(升序或降序),这意味着日期应该排序。
您可以尝试以下方法:
df['Invoice_Date'] = pd.to_datetime(df['Invoice_Date'])
df.set_index('Invoice_Date', inplace=True)
df.sort_index(inplace=True)
df['max'] = df.groupby('ID')['Delay'].transform(lambda x: x.rolling('30D', closed='left').max())
编辑:正如@Cainã 所建议的那样,包含了一个 groupby
以保证为每个唯一的 ID
单独完成此过程
需要 closed
参数来指定不应包括当天。
新的dataframe如下(这里只按Invoice_Date
排序)
ID Delay Max
Invoice_Date
2020-05-13 48002 29 NaN
2020-06-20 48002 0 NaN
2020-06-26 27459 2 NaN
2020-06-28 48002 1 0.0
2020-06-29 27459 -2 2.0
2020-06-30 27459 0 2.0
2020-07-14 27459 1 2.0
2020-07-25 27459 2 2.0
2020-07-30 27459 9 2.0
2020-08-02 27459 12 9.0
如果我们也按 ID
排序(按 运行 df.reset_index().sort_values(['ID','Invoice_Date'])
),我们得到:
ID Delay Max
Invoice_Date
2020-05-13 48002 29 NaN
2020-06-20 48002 0 NaN
2020-06-26 27459 2 NaN
2020-06-28 48002 1 0.0
2020-06-29 27459 -2 2.0
2020-06-30 27459 0 2.0
2020-07-14 27459 1 2.0
2020-07-25 27459 2 2.0
2020-07-30 27459 9 2.0
2020-08-02 27459 12 9.0
df.rolling
可以完成工作并且可能是最高效的。
df["Invoice_Date"] = df.Invoice_Date.astype("datetime64")
df["Max"] = df.groupby("ID").rolling("30d", on="Invoice_Date", closed="left").Delay.max().values
结果:
ID Invoice_Date Payment_Term Payment_Date Due_Date Delay Difference_Date Max
0 27459 2020-06-26 7 2020-07-05 2020-07-03 2 0 NaN
1 27459 2020-06-29 8 2020-07-05 2020-07-07 -2 3 2.0
2 27459 2020-06-30 3 2020-07-03 2020-07-03 0 1 2.0
3 27459 2020-07-14 6 2020-07-21 2020-07-20 1 14 2.0
4 27459 2020-07-25 4 2020-07-31 2020-07-29 2 11 2.0
5 27459 2020-07-30 7 2020-08-15 2020-08-06 9 5 2.0
6 27459 2020-08-02 8 2020-08-22 2020-08-10 12 3 9.0
7 48002 2020-05-13 5 2020-06-16 2020-05-18 29 0 NaN
8 48002 2020-06-20 3 2020-06-23 2020-06-23 0 38 NaN
9 48002 2020-06-28 6 2020-07-05 2020-07-04 1 8 0.0
我有如下数据集:
data = pd.DataFrame({
'ID': ['27459', '27459', '27459', '27459', '27459', '27459', '27459', '48002', '48002', '48002'],
'Invoice_Date': ['2020-06-26', '2020-06-29', '2020-06-30', '2020-07-14', '2020-07-25',
'2020-07-30', '2020-08-02', '2020-05-13', '2020-06-20', '2020-06-28'],
'Payment_Term': [7,8,3,6,4,7,8,5,3,6],
'Payment_Date': ['2020-07-05', '2020-07-05','2020-07-03', '2020-07-21', '2020-07-31',
'2020-08-15', '2020-08-22', '2020-06-16', '2020-06-23', '2020-07-05'],
'Due_Date': ['2020-07-03', '2020-07-07', '2020-07-03', '2020-07-20', '2020-07-29',
'2020-08-06', '2020-08-10', '2020-05-18', '2020-06-23', '2020-07-04'],
'Delay': [2,-2,0,1,2,9,12,29,0,1],
'Difference_Date': [0,3,1,14,11,5,3,0,38,8],
})
data
我需要添加另一列 Max
以显示前 Delay
行的最大值。它还有一个条件,就是应该有30天的滚动期。这意味着,对于当前行中的 Max
,将从当前行 Invoice_Date
.
所需的输出是:
ID Invoice_Date Payment_Term Payment_Date Due_Date Delay Difference_Date Max
27459 2020-06-26 7 2020-07-05 2020-07-03 2 0 0
27459 2020-06-29 8 2020-07-05 2020-07-07 -2 3 2
27459 2020-06-30 3 2020-07-03 2020-07-03 0 1 2
27459 2020-07-14 6 2020-07-21 2020-07-20 1 14 2
27459 2020-07-25 4 2020-07-31 2020-07-29 2 11 2
27459 2020-07-30 7 2020-08-15 2020-08-06 9 5 2
27459 2020-08-02 8 2020-08-22 2020-08-10 12 3 9
48002 2020-05-13 5 2020-06-16 2020-05-18 29 0 0
48002 2020-06-20 3 2020-06-23 2020-06-23 0 38 29
48002 2020-06-28 6 2020-07-05 2020-07-04 1 8 29
一种可行的方法:
data['Invoice_Date'] = pd.to_datetime(data['Invoice_Date'])
groups = data.groupby('ID')
for group_name, df_group in groups:
for idx,row in df_group.iterrows():
dt_range = pd.date_range(row['Invoice_Date'] - pd.to_timedelta(30, 'day'), row['Invoice_Date'])[:-1]
data.loc[idx, 'max'] = df_group[df_group.Invoice_Date.isin(dt_range)].Delay.max()
print(data)
输出:
ID Invoice_Date Payment_Term Payment_Date Due_Date Delay Difference_Date max
0 27459 2020-06-26 7 2020-07-05 2020-07-03 2 0 NaN
1 27459 2020-06-29 8 2020-07-05 2020-07-07 -2 3 2.0
2 27459 2020-06-30 3 2020-07-03 2020-07-03 0 1 2.0
3 27459 2020-07-14 6 2020-07-21 2020-07-20 1 14 2.0
4 27459 2020-07-25 4 2020-07-31 2020-07-29 2 11 2.0
5 27459 2020-07-30 7 2020-08-15 2020-08-06 9 5 2.0
6 27459 2020-08-02 8 2020-08-22 2020-08-10 12 3 9.0
7 48002 2020-05-13 5 2020-06-16 2020-05-18 29 0 NaN
8 48002 2020-06-20 3 2020-06-23 2020-06-23 0 38 NaN
9 48002 2020-06-28 6 2020-07-05 2020-07-04 1 8 0.0
您可以用 data.fillna(0)
填充 NaN。看到 ID“48002”的第一个值是 NaN,因为之前的值不在 30 天的范围内。
您可以使用rolling
方法只对一些过去的元素进行操作。但是,日期应该是单调的(升序或降序),这意味着日期应该排序。
您可以尝试以下方法:
df['Invoice_Date'] = pd.to_datetime(df['Invoice_Date'])
df.set_index('Invoice_Date', inplace=True)
df.sort_index(inplace=True)
df['max'] = df.groupby('ID')['Delay'].transform(lambda x: x.rolling('30D', closed='left').max())
编辑:正如@Cainã 所建议的那样,包含了一个 groupby
以保证为每个唯一的 ID
需要 closed
参数来指定不应包括当天。
新的dataframe如下(这里只按Invoice_Date
排序)
ID Delay Max
Invoice_Date
2020-05-13 48002 29 NaN
2020-06-20 48002 0 NaN
2020-06-26 27459 2 NaN
2020-06-28 48002 1 0.0
2020-06-29 27459 -2 2.0
2020-06-30 27459 0 2.0
2020-07-14 27459 1 2.0
2020-07-25 27459 2 2.0
2020-07-30 27459 9 2.0
2020-08-02 27459 12 9.0
如果我们也按 ID
排序(按 运行 df.reset_index().sort_values(['ID','Invoice_Date'])
),我们得到:
ID Delay Max
Invoice_Date
2020-05-13 48002 29 NaN
2020-06-20 48002 0 NaN
2020-06-26 27459 2 NaN
2020-06-28 48002 1 0.0
2020-06-29 27459 -2 2.0
2020-06-30 27459 0 2.0
2020-07-14 27459 1 2.0
2020-07-25 27459 2 2.0
2020-07-30 27459 9 2.0
2020-08-02 27459 12 9.0
df.rolling
可以完成工作并且可能是最高效的。
df["Invoice_Date"] = df.Invoice_Date.astype("datetime64")
df["Max"] = df.groupby("ID").rolling("30d", on="Invoice_Date", closed="left").Delay.max().values
结果:
ID Invoice_Date Payment_Term Payment_Date Due_Date Delay Difference_Date Max
0 27459 2020-06-26 7 2020-07-05 2020-07-03 2 0 NaN
1 27459 2020-06-29 8 2020-07-05 2020-07-07 -2 3 2.0
2 27459 2020-06-30 3 2020-07-03 2020-07-03 0 1 2.0
3 27459 2020-07-14 6 2020-07-21 2020-07-20 1 14 2.0
4 27459 2020-07-25 4 2020-07-31 2020-07-29 2 11 2.0
5 27459 2020-07-30 7 2020-08-15 2020-08-06 9 5 2.0
6 27459 2020-08-02 8 2020-08-22 2020-08-10 12 3 9.0
7 48002 2020-05-13 5 2020-06-16 2020-05-18 29 0 NaN
8 48002 2020-06-20 3 2020-06-23 2020-06-23 0 38 NaN
9 48002 2020-06-28 6 2020-07-05 2020-07-04 1 8 0.0