如何使用 pandas 计算 groupby 函数的累计时间?
How to calculate the accumulated time of a groupby function using pandas?
各位!我有以下数据集 (https://pastebin.com/697NsZXk):
dfA
Out[83]:
time Var1 Y1 Class1 flagA
2070 2020-09-15 10:30:00 66.3260 59.6 A-8444 1
2071 2020-09-15 10:31:00 66.2881 59.6 A-8444 1
2072 2020-09-15 10:32:00 66.2570 59.6 A-8444 1
2073 2020-09-15 10:33:00 66.2364 59.6 A-8444 1
2074 2020-09-15 10:34:00 66.2511 59.6 A-8444 1
2075 2020-09-15 10:35:00 66.2478 59.6 A-8444 1
2076 2020-09-15 10:36:00 66.2571 59.6 A-8444 1
2077 2020-09-15 10:37:00 66.2645 59.6 A-8444 1
2078 2020-09-15 10:38:00 66.2233 59.6 A-8444 1
2079 2020-09-15 10:39:00 66.2132 59.6 A-8444 1
... ... ... ... ...
3501 2020-09-16 10:21:00 58.8167 59.3 A-8448 1
3502 2020-09-16 10:22:00 59.1132 59.3 A-8448 1
3503 2020-09-16 10:23:00 59.4533 59.3 A-8448 1
3504 2020-09-16 10:24:00 59.7931 59.3 A-8448 1
3505 2020-09-16 10:25:00 60.1398 59.3 A-8448 1
3506 2020-09-16 10:26:00 60.5043 59.3 A-8448 1
3507 2020-09-16 10:27:00 60.8606 59.3 A-8448 1
3508 2020-09-16 10:28:00 61.2513 59.3 A-8448 1
3509 2020-09-16 10:29:00 61.6430 59.3 A-8448 1
3510 2020-09-16 10:30:00 62.0610 59.3 A-8448 1
[1441 rows x 5 columns]
我想计算 Var1 和 Y1 的平均值、最小值和最大值,按 ['Class1'、'flagA'] 分组。我能够使用下面的代码做到这一点,但我也想计算每个“组”之间的累计时间。比如a得到的结果是:
Var1 Y1
amin amax average amin amax average
Class1 flagA
A-8444 0 26.6498 49.8490 34.371305 59.6 59.6 59.6
1 50.0507 67.0296 63.722390 59.6 59.6 59.6
A-8445 0 27.0750 49.8547 36.590446 59.7 59.7 59.7
1 50.1771 67.0874 63.562250 59.7 59.7 59.7
A-8446 0 26.2272 49.4617 33.005098 59.4 59.4 59.4
1 50.2412 67.1156 63.853893 59.4 59.4 59.4
A-8448 0 25.6820 49.6583 33.084543 59.3 59.3 59.3
1 50.0283 62.0610 56.053144 59.3 59.3 59.3
但我还需要另一列显示每个组代表多少时间间隔。有任何想法吗?应该是类似的东西:
Var1 Y1
amin amax average amin amax average **accumulated time**
Class1 flagA
A-8444 0 26.6498 49.8490 34.371305 59.6 59.6 59.6 **hh:mm:ss**
1 50.0507 67.0296 63.722390 59.6 59.6 59.6 **hh:mm:ss**
A-8445 0 27.0750 49.8547 36.590446 59.7 59.7 59.7 **hh:mm:ss**
1 50.1771 67.0874 63.562250 59.7 59.7 59.7 **hh:mm:ss**
A-8446 0 26.2272 49.4617 33.005098 59.4 59.4 59.4 **hh:mm:ss**
1 50.2412 67.1156 63.853893 59.4 59.4 59.4 **hh:mm:ss**
A-8448 0 25.6820 49.6583 33.084543 59.3 59.3 59.3 **hh:mm:ss**
1 50.0283 62.0610 56.053144 59.3 59.3 59.3 **hh:mm:ss**
当前代码:
#Creating flagA
conditions = [
(dfA['Var1'] < 50),
(dfA['Var1'] >= 50)
]
values = [0, 1]
dfA.loc[:,'flagA'] = np.select(conditions, values)
#groupby to calculate min, max and average. Need to add something to calculate accumulated time.
teste = dfA.groupby(['Class1','flagA']).agg([np.min, np.max, np.average])
将它们组合在一起,找出最短和最长时间。将它们组合在一起以创建多数据框。然后我们求出最小和最大时间的差值,转换成时间格式。
import pandas as pd
import numpy as np
dfA = pd.read_csv('./Data/697NsZXk.csv', sep=',')
teste = dfA.groupby(['Class1','flagA']).agg([np.min, np.max, np.average])
dfA['time'] = pd.to_datetime(dfA['time'])
ts_min = dfA.groupby(['Class1','flagA'])['time'].min()
ts_max = dfA.groupby(['Class1','flagA'])['time'].max()
ts = pd.concat([ts_min,ts_max], axis=1)
ts.columns = ['ts_min', 'ts_max']
ts['ts_delta'] = ts['ts_max'] - ts['ts_min']
final = pd.concat([teste, ts[['ts_delta']]], axis=1)
final['ts_delta'] = final['ts_delta'].apply(lambda x: str(int(x.total_seconds() // 3600))+':'+ str(int(x.total_seconds() % 3600 // 60)))
final
(Var1, amin) (Var1, amax) (Var1, average) (Y1, amin) (Y1, amax) (Y1, average) ts_delta
Class1 flagA
A-8444 0 26.6498 49.8490 34.371305 59.6 59.6 59.6 8:23
1 50.0507 67.0296 63.722390 59.6 59.6 59.6 8:10
A-8445 0 27.0750 49.8547 36.590446 59.7 59.7 59.7 6:38
1 50.1771 67.0874 63.562250 59.7 59.7 59.7 4:44
A-8446 0 26.2272 49.4617 33.005098 59.4 59.4 59.4 1:28
1 50.2412 67.1156 63.853893 59.4 59.4 59.4 4:30
A-8448 0 25.6820 49.6583 33.084543 59.3 59.3 59.3 1:20
1 50.0283 62.0610 56.053144 59.3 59.3 59.3 0:38
首先,让我们确保 'time' 实际上是 datetime
类型而不是 str
:
dfA['time'] = pd.to_datetime(dfA['time'])
然后我们可以将最大和最小聚合器应用于 time
。由于我们不能在日期时间上执行 np.min
,我们可以用 min
替换,对于最大值也是如此。但是没有 average
可以同时用于浮点数和日期时间。所以我们需要更具体地说明将哪些聚合函数应用于哪些列/将您的 groupby 替换为
teste = dfA.groupby(['Class1','flagA']).agg({'time':[min, max], 'Var1':[min, max, np.average], 'Y1':[min, max, np.average]})
然后我们可以使用最大时间和最小时间来计算累计时间:
teste['accumulated time'] = teste[('time','max')] - teste[('time','min')]
我们得到了这样的结果,我认为这就是您想要的(或足够接近)
| | ('time', 'min') | ('time', 'max') | ('Var1', 'min') | ('Var1', 'max') | ('Var1', 'average') | ('Y1', 'min') | ('Y1', 'max') | ('Y1', 'average') | ('accumulated time', '') |
|:--------------|:--------------------|:--------------------|------------------:|------------------:|----------------------:|----------------:|----------------:|--------------------:|:---------------------------|
| ('A-8444', 0) | 2020-09-15 11:28:00 | 2020-09-15 19:51:00 | 26.6498 | 49.849 | 34.3713 | 59.6 | 59.6 | 59.6 | 0 days 08:23:00 |
| ('A-8444', 1) | 2020-09-15 10:30:00 | 2020-09-15 18:40:00 | 50.0507 | 67.0296 | 63.7224 | 59.6 | 59.6 | 59.6 | 0 days 08:10:00 |
| ('A-8445', 0) | 2020-09-15 19:52:00 | 2020-09-16 02:30:00 | 27.075 | 49.8547 | 36.5904 | 59.7 | 59.7 | 59.7 | 0 days 06:38:00 |
| ('A-8445', 1) | 2020-09-15 21:17:00 | 2020-09-16 02:01:00 | 50.1771 | 67.0874 | 63.5622 | 59.7 | 59.7 | 59.7 | 0 days 04:44:00 |
| ('A-8446', 0) | 2020-09-16 02:31:00 | 2020-09-16 03:59:00 | 26.2272 | 49.4617 | 33.0051 | 59.4 | 59.4 | 59.4 | 0 days 01:28:00 |```
各位!我有以下数据集 (https://pastebin.com/697NsZXk):
dfA
Out[83]:
time Var1 Y1 Class1 flagA
2070 2020-09-15 10:30:00 66.3260 59.6 A-8444 1
2071 2020-09-15 10:31:00 66.2881 59.6 A-8444 1
2072 2020-09-15 10:32:00 66.2570 59.6 A-8444 1
2073 2020-09-15 10:33:00 66.2364 59.6 A-8444 1
2074 2020-09-15 10:34:00 66.2511 59.6 A-8444 1
2075 2020-09-15 10:35:00 66.2478 59.6 A-8444 1
2076 2020-09-15 10:36:00 66.2571 59.6 A-8444 1
2077 2020-09-15 10:37:00 66.2645 59.6 A-8444 1
2078 2020-09-15 10:38:00 66.2233 59.6 A-8444 1
2079 2020-09-15 10:39:00 66.2132 59.6 A-8444 1
... ... ... ... ...
3501 2020-09-16 10:21:00 58.8167 59.3 A-8448 1
3502 2020-09-16 10:22:00 59.1132 59.3 A-8448 1
3503 2020-09-16 10:23:00 59.4533 59.3 A-8448 1
3504 2020-09-16 10:24:00 59.7931 59.3 A-8448 1
3505 2020-09-16 10:25:00 60.1398 59.3 A-8448 1
3506 2020-09-16 10:26:00 60.5043 59.3 A-8448 1
3507 2020-09-16 10:27:00 60.8606 59.3 A-8448 1
3508 2020-09-16 10:28:00 61.2513 59.3 A-8448 1
3509 2020-09-16 10:29:00 61.6430 59.3 A-8448 1
3510 2020-09-16 10:30:00 62.0610 59.3 A-8448 1
[1441 rows x 5 columns]
我想计算 Var1 和 Y1 的平均值、最小值和最大值,按 ['Class1'、'flagA'] 分组。我能够使用下面的代码做到这一点,但我也想计算每个“组”之间的累计时间。比如a得到的结果是:
Var1 Y1
amin amax average amin amax average
Class1 flagA
A-8444 0 26.6498 49.8490 34.371305 59.6 59.6 59.6
1 50.0507 67.0296 63.722390 59.6 59.6 59.6
A-8445 0 27.0750 49.8547 36.590446 59.7 59.7 59.7
1 50.1771 67.0874 63.562250 59.7 59.7 59.7
A-8446 0 26.2272 49.4617 33.005098 59.4 59.4 59.4
1 50.2412 67.1156 63.853893 59.4 59.4 59.4
A-8448 0 25.6820 49.6583 33.084543 59.3 59.3 59.3
1 50.0283 62.0610 56.053144 59.3 59.3 59.3
但我还需要另一列显示每个组代表多少时间间隔。有任何想法吗?应该是类似的东西:
Var1 Y1
amin amax average amin amax average **accumulated time**
Class1 flagA
A-8444 0 26.6498 49.8490 34.371305 59.6 59.6 59.6 **hh:mm:ss**
1 50.0507 67.0296 63.722390 59.6 59.6 59.6 **hh:mm:ss**
A-8445 0 27.0750 49.8547 36.590446 59.7 59.7 59.7 **hh:mm:ss**
1 50.1771 67.0874 63.562250 59.7 59.7 59.7 **hh:mm:ss**
A-8446 0 26.2272 49.4617 33.005098 59.4 59.4 59.4 **hh:mm:ss**
1 50.2412 67.1156 63.853893 59.4 59.4 59.4 **hh:mm:ss**
A-8448 0 25.6820 49.6583 33.084543 59.3 59.3 59.3 **hh:mm:ss**
1 50.0283 62.0610 56.053144 59.3 59.3 59.3 **hh:mm:ss**
当前代码:
#Creating flagA
conditions = [
(dfA['Var1'] < 50),
(dfA['Var1'] >= 50)
]
values = [0, 1]
dfA.loc[:,'flagA'] = np.select(conditions, values)
#groupby to calculate min, max and average. Need to add something to calculate accumulated time.
teste = dfA.groupby(['Class1','flagA']).agg([np.min, np.max, np.average])
将它们组合在一起,找出最短和最长时间。将它们组合在一起以创建多数据框。然后我们求出最小和最大时间的差值,转换成时间格式。
import pandas as pd
import numpy as np
dfA = pd.read_csv('./Data/697NsZXk.csv', sep=',')
teste = dfA.groupby(['Class1','flagA']).agg([np.min, np.max, np.average])
dfA['time'] = pd.to_datetime(dfA['time'])
ts_min = dfA.groupby(['Class1','flagA'])['time'].min()
ts_max = dfA.groupby(['Class1','flagA'])['time'].max()
ts = pd.concat([ts_min,ts_max], axis=1)
ts.columns = ['ts_min', 'ts_max']
ts['ts_delta'] = ts['ts_max'] - ts['ts_min']
final = pd.concat([teste, ts[['ts_delta']]], axis=1)
final['ts_delta'] = final['ts_delta'].apply(lambda x: str(int(x.total_seconds() // 3600))+':'+ str(int(x.total_seconds() % 3600 // 60)))
final
(Var1, amin) (Var1, amax) (Var1, average) (Y1, amin) (Y1, amax) (Y1, average) ts_delta
Class1 flagA
A-8444 0 26.6498 49.8490 34.371305 59.6 59.6 59.6 8:23
1 50.0507 67.0296 63.722390 59.6 59.6 59.6 8:10
A-8445 0 27.0750 49.8547 36.590446 59.7 59.7 59.7 6:38
1 50.1771 67.0874 63.562250 59.7 59.7 59.7 4:44
A-8446 0 26.2272 49.4617 33.005098 59.4 59.4 59.4 1:28
1 50.2412 67.1156 63.853893 59.4 59.4 59.4 4:30
A-8448 0 25.6820 49.6583 33.084543 59.3 59.3 59.3 1:20
1 50.0283 62.0610 56.053144 59.3 59.3 59.3 0:38
首先,让我们确保 'time' 实际上是 datetime
类型而不是 str
:
dfA['time'] = pd.to_datetime(dfA['time'])
然后我们可以将最大和最小聚合器应用于 time
。由于我们不能在日期时间上执行 np.min
,我们可以用 min
替换,对于最大值也是如此。但是没有 average
可以同时用于浮点数和日期时间。所以我们需要更具体地说明将哪些聚合函数应用于哪些列/将您的 groupby 替换为
teste = dfA.groupby(['Class1','flagA']).agg({'time':[min, max], 'Var1':[min, max, np.average], 'Y1':[min, max, np.average]})
然后我们可以使用最大时间和最小时间来计算累计时间:
teste['accumulated time'] = teste[('time','max')] - teste[('time','min')]
我们得到了这样的结果,我认为这就是您想要的(或足够接近)
| | ('time', 'min') | ('time', 'max') | ('Var1', 'min') | ('Var1', 'max') | ('Var1', 'average') | ('Y1', 'min') | ('Y1', 'max') | ('Y1', 'average') | ('accumulated time', '') |
|:--------------|:--------------------|:--------------------|------------------:|------------------:|----------------------:|----------------:|----------------:|--------------------:|:---------------------------|
| ('A-8444', 0) | 2020-09-15 11:28:00 | 2020-09-15 19:51:00 | 26.6498 | 49.849 | 34.3713 | 59.6 | 59.6 | 59.6 | 0 days 08:23:00 |
| ('A-8444', 1) | 2020-09-15 10:30:00 | 2020-09-15 18:40:00 | 50.0507 | 67.0296 | 63.7224 | 59.6 | 59.6 | 59.6 | 0 days 08:10:00 |
| ('A-8445', 0) | 2020-09-15 19:52:00 | 2020-09-16 02:30:00 | 27.075 | 49.8547 | 36.5904 | 59.7 | 59.7 | 59.7 | 0 days 06:38:00 |
| ('A-8445', 1) | 2020-09-15 21:17:00 | 2020-09-16 02:01:00 | 50.1771 | 67.0874 | 63.5622 | 59.7 | 59.7 | 59.7 | 0 days 04:44:00 |
| ('A-8446', 0) | 2020-09-16 02:31:00 | 2020-09-16 03:59:00 | 26.2272 | 49.4617 | 33.0051 | 59.4 | 59.4 | 59.4 | 0 days 01:28:00 |```