更正 Timedelta 列上的 Pandas 累积和
Correcting Pandas Cumulative Sum on a Timedelta Column
我目前有一行代码试图创建一个基于日期之间 timedelta 数据累积和的列。然而它并没有正确地在所有地方执行累积和,而且我还收到警告,我的 python 行代码将来不会工作。
原始数据集如下:
ID CREATION_DATE TIMEDIFF EDITNUMB
8211 11/26/2019 13:00 1
8211 1/3/2020 9:11 37 days 20:11:09.000000000 1
8211 2/3/2020 14:52 31 days 05:40:57.000000000 1
8211 3/27/2020 15:00 53 days 00:07:49.000000000 1
8211 4/29/2020 12:07 32 days 21:07:23.000000000 1
这是我的 python 行代码:
df['RECUR'] = df.groupby(['ID']).TIMEDIFF.apply(lambda x: x.shift().fillna(1).cumsum())
生成新列 'RECUR',该列未根据 'TIMEDIFF' 列中的数据正确累加:
ID CREATION_DATE TIMEDIFF EDITNUMB RECUR
8211 11/26/2019 13:00 1 0 days 00:00:01.000000000
8211 1/3/2020 9:11 37 days 20:11:09.000000000 1 0 days 00:00:02.000000000
8211 2/3/2020 14:52 31 days 05:40:57.000000000 1 37 days 20:11:11.000000000
8211 3/27/2020 15:00 53 days 00:07:49.000000000 1 69 days 01:52:08.000000000
8211 4/29/2020 12:07 32 days 21:07:23.000000000 1 122 days 01:59:57.000000000
这也会产生此警告:
FutureWarning: Passing integers to fillna is deprecated, will raise a TypeError in a future version. To retain the old behavior, pass pd.Timedelta(seconds=n) instead.
不胜感激,从 2019 年 11 月 26 日开始,总计应为 153 天,并正确显示在 'RECUR' 列中。
IIUC,你可以这样做:
# transform('first') would also work
df['RECUR'] = df['CREATION_DATE'] - df.groupby('ID').CREATION_DATE.transform('min')
输出:
ID CREATION_DATE TIMEDIFF EDITNUMB RECUR
0 8211 2019-11-26 13:00:00 NaT 1 0 days 00:00:00
1 8211 2020-01-03 09:11:00 37 days 20:11:00 1 37 days 20:11:00
2 8211 2020-02-03 14:52:00 31 days 05:41:00 1 69 days 01:52:00
3 8211 2020-03-27 15:00:00 53 days 00:08:00 1 122 days 02:00:00
4 8211 2020-04-29 12:07:00 32 days 21:07:00 1 154 days 23:07:00
您可以 fillna
,timedelta
为 0 秒,然后执行 cumsum
df['RECUR'] = df.groupby('ID').TIMEDIFF.apply(
lambda x: x.fillna(pd.Timedelta(seconds=0)).cumsum())
df['RECUR']
# 0 0 days 00:00:00
# 1 37 days 20:11:09
# 2 69 days 01:52:06
# 3 122 days 01:59:55
# 4 154 days 23:07:18
我目前有一行代码试图创建一个基于日期之间 timedelta 数据累积和的列。然而它并没有正确地在所有地方执行累积和,而且我还收到警告,我的 python 行代码将来不会工作。
原始数据集如下:
ID CREATION_DATE TIMEDIFF EDITNUMB
8211 11/26/2019 13:00 1
8211 1/3/2020 9:11 37 days 20:11:09.000000000 1
8211 2/3/2020 14:52 31 days 05:40:57.000000000 1
8211 3/27/2020 15:00 53 days 00:07:49.000000000 1
8211 4/29/2020 12:07 32 days 21:07:23.000000000 1
这是我的 python 行代码:
df['RECUR'] = df.groupby(['ID']).TIMEDIFF.apply(lambda x: x.shift().fillna(1).cumsum())
生成新列 'RECUR',该列未根据 'TIMEDIFF' 列中的数据正确累加:
ID CREATION_DATE TIMEDIFF EDITNUMB RECUR
8211 11/26/2019 13:00 1 0 days 00:00:01.000000000
8211 1/3/2020 9:11 37 days 20:11:09.000000000 1 0 days 00:00:02.000000000
8211 2/3/2020 14:52 31 days 05:40:57.000000000 1 37 days 20:11:11.000000000
8211 3/27/2020 15:00 53 days 00:07:49.000000000 1 69 days 01:52:08.000000000
8211 4/29/2020 12:07 32 days 21:07:23.000000000 1 122 days 01:59:57.000000000
这也会产生此警告:
FutureWarning: Passing integers to fillna is deprecated, will raise a TypeError in a future version. To retain the old behavior, pass pd.Timedelta(seconds=n) instead.
不胜感激,从 2019 年 11 月 26 日开始,总计应为 153 天,并正确显示在 'RECUR' 列中。
IIUC,你可以这样做:
# transform('first') would also work
df['RECUR'] = df['CREATION_DATE'] - df.groupby('ID').CREATION_DATE.transform('min')
输出:
ID CREATION_DATE TIMEDIFF EDITNUMB RECUR
0 8211 2019-11-26 13:00:00 NaT 1 0 days 00:00:00
1 8211 2020-01-03 09:11:00 37 days 20:11:00 1 37 days 20:11:00
2 8211 2020-02-03 14:52:00 31 days 05:41:00 1 69 days 01:52:00
3 8211 2020-03-27 15:00:00 53 days 00:08:00 1 122 days 02:00:00
4 8211 2020-04-29 12:07:00 32 days 21:07:00 1 154 days 23:07:00
您可以 fillna
,timedelta
为 0 秒,然后执行 cumsum
df['RECUR'] = df.groupby('ID').TIMEDIFF.apply(
lambda x: x.fillna(pd.Timedelta(seconds=0)).cumsum())
df['RECUR']
# 0 0 days 00:00:00
# 1 37 days 20:11:09
# 2 69 days 01:52:06
# 3 122 days 01:59:55
# 4 154 days 23:07:18