如何使用 pandas 计算 groupby 函数的累计时间?

How to calculate the accumulated time of a groupby function using pandas?

各位!我有以下数据集 (https://pastebin.com/697NsZXk):

dfA
Out[83]: 
                    time     Var1    Y1  Class1  flagA
2070 2020-09-15 10:30:00  66.3260  59.6  A-8444      1
2071 2020-09-15 10:31:00  66.2881  59.6  A-8444      1
2072 2020-09-15 10:32:00  66.2570  59.6  A-8444      1
2073 2020-09-15 10:33:00  66.2364  59.6  A-8444      1
2074 2020-09-15 10:34:00  66.2511  59.6  A-8444      1
2075 2020-09-15 10:35:00  66.2478  59.6  A-8444      1
2076 2020-09-15 10:36:00  66.2571  59.6  A-8444      1
2077 2020-09-15 10:37:00  66.2645  59.6  A-8444      1
2078 2020-09-15 10:38:00  66.2233  59.6  A-8444      1
2079 2020-09-15 10:39:00  66.2132  59.6  A-8444      1
                 ...      ...   ...     ...    ...
3501 2020-09-16 10:21:00  58.8167  59.3  A-8448      1
3502 2020-09-16 10:22:00  59.1132  59.3  A-8448      1
3503 2020-09-16 10:23:00  59.4533  59.3  A-8448      1
3504 2020-09-16 10:24:00  59.7931  59.3  A-8448      1
3505 2020-09-16 10:25:00  60.1398  59.3  A-8448      1
3506 2020-09-16 10:26:00  60.5043  59.3  A-8448      1
3507 2020-09-16 10:27:00  60.8606  59.3  A-8448      1
3508 2020-09-16 10:28:00  61.2513  59.3  A-8448      1
3509 2020-09-16 10:29:00  61.6430  59.3  A-8448      1
3510 2020-09-16 10:30:00  62.0610  59.3  A-8448      1

[1441 rows x 5 columns]

我想计算 Var1 和 Y1 的平均值、最小值和最大值,按 ['Class1'、'flagA'] 分组。我能够使用下面的代码做到这一点,但我也想计算每个“组”之间的累计时间。比如a得到的结果是:

                     Var1                        Y1              
                 amin     amax    average  amin  amax average
Class1 flagA                                                 
A-8444 0      26.6498  49.8490  34.371305  59.6  59.6    59.6
       1      50.0507  67.0296  63.722390  59.6  59.6    59.6
A-8445 0      27.0750  49.8547  36.590446  59.7  59.7    59.7
       1      50.1771  67.0874  63.562250  59.7  59.7    59.7
A-8446 0      26.2272  49.4617  33.005098  59.4  59.4    59.4
       1      50.2412  67.1156  63.853893  59.4  59.4    59.4
A-8448 0      25.6820  49.6583  33.084543  59.3  59.3    59.3
       1      50.0283  62.0610  56.053144  59.3  59.3    59.3

但我还需要另一列显示每个组代表多少时间间隔。有任何想法吗?应该是类似的东西:

                 Var1                        Y1              
                 amin     amax    average  amin  amax average  **accumulated time**
Class1 flagA                                                 
A-8444 0      26.6498  49.8490  34.371305  59.6  59.6    59.6    **hh:mm:ss**
       1      50.0507  67.0296  63.722390  59.6  59.6    59.6    **hh:mm:ss**
A-8445 0      27.0750  49.8547  36.590446  59.7  59.7    59.7    **hh:mm:ss**
       1      50.1771  67.0874  63.562250  59.7  59.7    59.7    **hh:mm:ss**
A-8446 0      26.2272  49.4617  33.005098  59.4  59.4    59.4    **hh:mm:ss**
       1      50.2412  67.1156  63.853893  59.4  59.4    59.4    **hh:mm:ss**
A-8448 0      25.6820  49.6583  33.084543  59.3  59.3    59.3    **hh:mm:ss**
       1      50.0283  62.0610  56.053144  59.3  59.3    59.3    **hh:mm:ss**

当前代码:

#Creating flagA
conditions = [
(dfA['Var1'] < 50),
(dfA['Var1'] >= 50)
]
values = [0, 1]
dfA.loc[:,'flagA'] = np.select(conditions, values)

#groupby to calculate min, max and average. Need to add something to calculate accumulated time.
teste = dfA.groupby(['Class1','flagA']).agg([np.min, np.max, np.average])

将它们组合在一起,找出最短和最长时间。将它们组合在一起以创建多数据框。然后我们求出最小和最大时间的差值,转换成时间格式。

import pandas as pd
import numpy as np

dfA = pd.read_csv('./Data/697NsZXk.csv', sep=',')
teste = dfA.groupby(['Class1','flagA']).agg([np.min, np.max, np.average])
dfA['time'] = pd.to_datetime(dfA['time'])
ts_min = dfA.groupby(['Class1','flagA'])['time'].min()
ts_max = dfA.groupby(['Class1','flagA'])['time'].max()
ts = pd.concat([ts_min,ts_max], axis=1)
ts.columns = ['ts_min', 'ts_max']
ts['ts_delta'] = ts['ts_max'] - ts['ts_min']
final = pd.concat([teste, ts[['ts_delta']]], axis=1)
final['ts_delta'] = final['ts_delta'].apply(lambda x: str(int(x.total_seconds() // 3600))+':'+ str(int(x.total_seconds() % 3600 // 60)))

final
    (Var1, amin)    (Var1, amax)    (Var1, average) (Y1, amin)  (Y1, amax)  (Y1, average)   ts_delta
Class1  flagA                           
A-8444  0   26.6498 49.8490 34.371305   59.6    59.6    59.6    8:23
1   50.0507 67.0296 63.722390   59.6    59.6    59.6    8:10
A-8445  0   27.0750 49.8547 36.590446   59.7    59.7    59.7    6:38
1   50.1771 67.0874 63.562250   59.7    59.7    59.7    4:44
A-8446  0   26.2272 49.4617 33.005098   59.4    59.4    59.4    1:28
1   50.2412 67.1156 63.853893   59.4    59.4    59.4    4:30
A-8448  0   25.6820 49.6583 33.084543   59.3    59.3    59.3    1:20
1   50.0283 62.0610 56.053144   59.3    59.3    59.3    0:38

首先,让我们确保 'time' 实际上是 datetime 类型而不是 str:

dfA['time'] = pd.to_datetime(dfA['time'])

然后我们可以将最大和最小聚合器应用于 time。由于我们不能在日期时间上执行 np.min,我们可以用 min 替换,对于最大值也是如此。但是没有 average 可以同时用于浮点数和日期时间。所以我们需要更具体地说明将哪些聚合函数应用于哪些列/将您的 groupby 替换为

teste = dfA.groupby(['Class1','flagA']).agg({'time':[min, max], 'Var1':[min, max, np.average], 'Y1':[min, max, np.average]})

然后我们可以使用最大时间和最小时间来计算累计时间:

teste['accumulated time'] = teste[('time','max')] - teste[('time','min')]

我们得到了这样的结果,我认为这就是您想要的(或足够接近)

|               | ('time', 'min')     | ('time', 'max')     |   ('Var1', 'min') |   ('Var1', 'max') |   ('Var1', 'average') |   ('Y1', 'min') |   ('Y1', 'max') |   ('Y1', 'average') | ('accumulated time', '')   |
|:--------------|:--------------------|:--------------------|------------------:|------------------:|----------------------:|----------------:|----------------:|--------------------:|:---------------------------|
| ('A-8444', 0) | 2020-09-15 11:28:00 | 2020-09-15 19:51:00 |           26.6498 |           49.849  |               34.3713 |            59.6 |            59.6 |                59.6 | 0 days 08:23:00            |
| ('A-8444', 1) | 2020-09-15 10:30:00 | 2020-09-15 18:40:00 |           50.0507 |           67.0296 |               63.7224 |            59.6 |            59.6 |                59.6 | 0 days 08:10:00            |
| ('A-8445', 0) | 2020-09-15 19:52:00 | 2020-09-16 02:30:00 |           27.075  |           49.8547 |               36.5904 |            59.7 |            59.7 |                59.7 | 0 days 06:38:00            |
| ('A-8445', 1) | 2020-09-15 21:17:00 | 2020-09-16 02:01:00 |           50.1771 |           67.0874 |               63.5622 |            59.7 |            59.7 |                59.7 | 0 days 04:44:00            |
| ('A-8446', 0) | 2020-09-16 02:31:00 | 2020-09-16 03:59:00 |           26.2272 |           49.4617 |               33.0051 |            59.4 |            59.4 |                59.4 | 0 days 01:28:00            |```