Pandas - 基于时间间隔的 Bin 行
Pandas - Bin rows based on time interval
有以下DF:
timestamp id val
0 2022-01-01 00:37:34 1.0 0.128464
1 2022-01-01 00:52:15 1.0 0.823504
2 2022-01-01 02:00:01 1.0 0.807617
3 2022-01-01 02:37:14 1.0 0.154851
4 2022-01-01 04:44:46 1.0 0.049817
5 2022-01-01 00:03:06 2.0 0.538565
6 2022-01-01 00:04:05 2.0 0.332919
7 2022-01-01 00:04:24 2.0 0.106591
8 2022-01-01 00:05:41 2.0 0.552562
9 2022-01-01 00:05:58 2.0 0.851130
10 2022-01-01 00:06:58 2.0 0.400711
11 2022-01-01 00:08:43 2.0 0.840532
12 2022-01-01 00:08:52 2.0 0.184425
13 2022-01-01 00:12:52 2.0 0.956525
14 2022-01-01 00:15:52 2.0 0.403509
我正在尝试对 5 分钟间隔内每一行的值进行分箱,并添加间隔超过 5 分钟的缺失行,如下所示:
timestamp id val
0 2022-01-01 00:37:34 1.0 0.128464 \_ val mean
1 2022-01-01 00:52:15 1.0 0.823504 /
-------------------------------------
Add missing 5min intervals with val 0
from 2022-01-01 00:52:15
to 2022-01-01 02:00:01
-------------------------------------
2 2022-01-01 02:00:01 1.0 0.807617 - val
-------------------------------------
Add missing 5min intervals with val 0
from 2022-01-01 02:00:01
to 2022-01-01 02:37:14
-------------------------------------
3 2022-01-01 02:37:14 1.0 0.154851 - val
-------------------------------------
Add missing 5min intervals with val 0
from 2022-01-01 02:37:14
to 2022-01-01 04:44:46
-------------------------------------
4 2022-01-01 04:44:46 1.0 0.049817 - val
New Group
-------------------------------------
5 2022-01-01 00:03:06 2.0 0.538565 \
6 2022-01-01 00:04:05 2.0 0.332919 |
7 2022-01-01 00:04:24 2.0 0.106591 |_ val mean
8 2022-01-01 00:05:41 2.0 0.552562 |
9 2022-01-01 00:05:58 2.0 0.851130 |
10 2022-01-01 00:06:58 2.0 0.400711 /
-------------------------------------
11 2022-01-01 00:08:43 2.0 0.840532 \
12 2022-01-01 00:08:52 2.0 0.184425 |_ val mean
13 2022-01-01 00:12:52 2.0 0.956525 /
-------------------------------------
14 2022-01-01 00:15:52 2.0 0.403509 - val
因此,生成的帧将包含 5 分钟的 val 平均值,以及没有 activity 发生的具有 0 val 的行。我尝试使用 pd.Grouper(key="timestamp", freq='5min', origin='start')
来获得 5 分钟的间隔,但我不确定下一步该从哪里开始。
如有任何帮助,我们将不胜感激。
这是您要找的吗:
def process(sdf):
return (sdf.resample("5min", on="timestamp", origin=sdf.timestamp.iat[0])
.mean().fillna({"id": sdf.name, "val": 0}))
df = (df.groupby("id", as_index=False).apply(process)
.droplevel(level=0, axis=0).reset_index(drop=False))
结果:
timestamp id val
0 2022-01-01 00:37:34 1.0 0.128464
1 2022-01-01 00:42:34 1.0 0.000000
2 2022-01-01 00:47:34 1.0 0.823504
3 2022-01-01 00:52:34 1.0 0.000000
4 2022-01-01 00:57:34 1.0 0.000000
5 2022-01-01 01:02:34 1.0 0.000000
6 2022-01-01 01:07:34 1.0 0.000000
7 2022-01-01 01:12:34 1.0 0.000000
8 2022-01-01 01:17:34 1.0 0.000000
9 2022-01-01 01:22:34 1.0 0.000000
10 2022-01-01 01:27:34 1.0 0.000000
11 2022-01-01 01:32:34 1.0 0.000000
12 2022-01-01 01:37:34 1.0 0.000000
13 2022-01-01 01:42:34 1.0 0.000000
14 2022-01-01 01:47:34 1.0 0.000000
15 2022-01-01 01:52:34 1.0 0.000000
16 2022-01-01 01:57:34 1.0 0.807617
17 2022-01-01 02:02:34 1.0 0.000000
18 2022-01-01 02:07:34 1.0 0.000000
19 2022-01-01 02:12:34 1.0 0.000000
20 2022-01-01 02:17:34 1.0 0.000000
21 2022-01-01 02:22:34 1.0 0.000000
22 2022-01-01 02:27:34 1.0 0.000000
23 2022-01-01 02:32:34 1.0 0.154851
24 2022-01-01 02:37:34 1.0 0.000000
25 2022-01-01 02:42:34 1.0 0.000000
26 2022-01-01 02:47:34 1.0 0.000000
27 2022-01-01 02:52:34 1.0 0.000000
28 2022-01-01 02:57:34 1.0 0.000000
29 2022-01-01 03:02:34 1.0 0.000000
30 2022-01-01 03:07:34 1.0 0.000000
31 2022-01-01 03:12:34 1.0 0.000000
32 2022-01-01 03:17:34 1.0 0.000000
33 2022-01-01 03:22:34 1.0 0.000000
34 2022-01-01 03:27:34 1.0 0.000000
35 2022-01-01 03:32:34 1.0 0.000000
36 2022-01-01 03:37:34 1.0 0.000000
37 2022-01-01 03:42:34 1.0 0.000000
38 2022-01-01 03:47:34 1.0 0.000000
39 2022-01-01 03:52:34 1.0 0.000000
40 2022-01-01 03:57:34 1.0 0.000000
41 2022-01-01 04:02:34 1.0 0.000000
42 2022-01-01 04:07:34 1.0 0.000000
43 2022-01-01 04:12:34 1.0 0.000000
44 2022-01-01 04:17:34 1.0 0.000000
45 2022-01-01 04:22:34 1.0 0.000000
46 2022-01-01 04:27:34 1.0 0.000000
47 2022-01-01 04:32:34 1.0 0.000000
48 2022-01-01 04:37:34 1.0 0.000000
49 2022-01-01 04:42:34 1.0 0.049817
50 2022-01-01 00:03:06 2.0 0.463746
51 2022-01-01 00:08:06 2.0 0.660494
52 2022-01-01 00:13:06 2.0 0.403509
但是我不明白这个要求:
timestamp id val
0 2022-01-01 00:37:34 1.0 0.128464 \_ val mean
1 2022-01-01 00:52:15 1.0 0.823504 /
这 2 个时间戳不在 5 分钟间隔内?
IIUC,你需要2种不同的分组方式。一种根据超过 5 分钟的连续差异进行拆分,一种从组开始生成 5 分钟的块:
# ensure datetime
df['timestamp'] = pd.to_datetime(df['timestamp'])
# split if difference exceeds 5 min
group = df.groupby('id')['timestamp'].diff().gt('5min').cumsum()
# group by 5 min chunks
g = pd.Grouper(key="timestamp", freq='5min', origin='start')
# aggregate
(df
.groupby(['id', group, g], as_index=False)
.agg({'timestamp': 'first', 'val': 'mean'})
)
输出:
id timestamp val
0 1.0 2022-01-01 00:37:34 0.128464
1 1.0 2022-01-01 00:52:15 0.823504
2 1.0 2022-01-01 02:00:01 0.807617
3 1.0 2022-01-01 02:37:14 0.154851
4 1.0 2022-01-01 04:44:46 0.049817
5 2.0 2022-01-01 00:03:06 0.463746
6 2.0 2022-01-01 00:08:43 0.660494
7 2.0 2022-01-01 00:15:52 0.403509
有以下DF:
timestamp id val
0 2022-01-01 00:37:34 1.0 0.128464
1 2022-01-01 00:52:15 1.0 0.823504
2 2022-01-01 02:00:01 1.0 0.807617
3 2022-01-01 02:37:14 1.0 0.154851
4 2022-01-01 04:44:46 1.0 0.049817
5 2022-01-01 00:03:06 2.0 0.538565
6 2022-01-01 00:04:05 2.0 0.332919
7 2022-01-01 00:04:24 2.0 0.106591
8 2022-01-01 00:05:41 2.0 0.552562
9 2022-01-01 00:05:58 2.0 0.851130
10 2022-01-01 00:06:58 2.0 0.400711
11 2022-01-01 00:08:43 2.0 0.840532
12 2022-01-01 00:08:52 2.0 0.184425
13 2022-01-01 00:12:52 2.0 0.956525
14 2022-01-01 00:15:52 2.0 0.403509
我正在尝试对 5 分钟间隔内每一行的值进行分箱,并添加间隔超过 5 分钟的缺失行,如下所示:
timestamp id val
0 2022-01-01 00:37:34 1.0 0.128464 \_ val mean
1 2022-01-01 00:52:15 1.0 0.823504 /
-------------------------------------
Add missing 5min intervals with val 0
from 2022-01-01 00:52:15
to 2022-01-01 02:00:01
-------------------------------------
2 2022-01-01 02:00:01 1.0 0.807617 - val
-------------------------------------
Add missing 5min intervals with val 0
from 2022-01-01 02:00:01
to 2022-01-01 02:37:14
-------------------------------------
3 2022-01-01 02:37:14 1.0 0.154851 - val
-------------------------------------
Add missing 5min intervals with val 0
from 2022-01-01 02:37:14
to 2022-01-01 04:44:46
-------------------------------------
4 2022-01-01 04:44:46 1.0 0.049817 - val
New Group
-------------------------------------
5 2022-01-01 00:03:06 2.0 0.538565 \
6 2022-01-01 00:04:05 2.0 0.332919 |
7 2022-01-01 00:04:24 2.0 0.106591 |_ val mean
8 2022-01-01 00:05:41 2.0 0.552562 |
9 2022-01-01 00:05:58 2.0 0.851130 |
10 2022-01-01 00:06:58 2.0 0.400711 /
-------------------------------------
11 2022-01-01 00:08:43 2.0 0.840532 \
12 2022-01-01 00:08:52 2.0 0.184425 |_ val mean
13 2022-01-01 00:12:52 2.0 0.956525 /
-------------------------------------
14 2022-01-01 00:15:52 2.0 0.403509 - val
因此,生成的帧将包含 5 分钟的 val 平均值,以及没有 activity 发生的具有 0 val 的行。我尝试使用 pd.Grouper(key="timestamp", freq='5min', origin='start')
来获得 5 分钟的间隔,但我不确定下一步该从哪里开始。
如有任何帮助,我们将不胜感激。
这是您要找的吗:
def process(sdf):
return (sdf.resample("5min", on="timestamp", origin=sdf.timestamp.iat[0])
.mean().fillna({"id": sdf.name, "val": 0}))
df = (df.groupby("id", as_index=False).apply(process)
.droplevel(level=0, axis=0).reset_index(drop=False))
结果:
timestamp id val
0 2022-01-01 00:37:34 1.0 0.128464
1 2022-01-01 00:42:34 1.0 0.000000
2 2022-01-01 00:47:34 1.0 0.823504
3 2022-01-01 00:52:34 1.0 0.000000
4 2022-01-01 00:57:34 1.0 0.000000
5 2022-01-01 01:02:34 1.0 0.000000
6 2022-01-01 01:07:34 1.0 0.000000
7 2022-01-01 01:12:34 1.0 0.000000
8 2022-01-01 01:17:34 1.0 0.000000
9 2022-01-01 01:22:34 1.0 0.000000
10 2022-01-01 01:27:34 1.0 0.000000
11 2022-01-01 01:32:34 1.0 0.000000
12 2022-01-01 01:37:34 1.0 0.000000
13 2022-01-01 01:42:34 1.0 0.000000
14 2022-01-01 01:47:34 1.0 0.000000
15 2022-01-01 01:52:34 1.0 0.000000
16 2022-01-01 01:57:34 1.0 0.807617
17 2022-01-01 02:02:34 1.0 0.000000
18 2022-01-01 02:07:34 1.0 0.000000
19 2022-01-01 02:12:34 1.0 0.000000
20 2022-01-01 02:17:34 1.0 0.000000
21 2022-01-01 02:22:34 1.0 0.000000
22 2022-01-01 02:27:34 1.0 0.000000
23 2022-01-01 02:32:34 1.0 0.154851
24 2022-01-01 02:37:34 1.0 0.000000
25 2022-01-01 02:42:34 1.0 0.000000
26 2022-01-01 02:47:34 1.0 0.000000
27 2022-01-01 02:52:34 1.0 0.000000
28 2022-01-01 02:57:34 1.0 0.000000
29 2022-01-01 03:02:34 1.0 0.000000
30 2022-01-01 03:07:34 1.0 0.000000
31 2022-01-01 03:12:34 1.0 0.000000
32 2022-01-01 03:17:34 1.0 0.000000
33 2022-01-01 03:22:34 1.0 0.000000
34 2022-01-01 03:27:34 1.0 0.000000
35 2022-01-01 03:32:34 1.0 0.000000
36 2022-01-01 03:37:34 1.0 0.000000
37 2022-01-01 03:42:34 1.0 0.000000
38 2022-01-01 03:47:34 1.0 0.000000
39 2022-01-01 03:52:34 1.0 0.000000
40 2022-01-01 03:57:34 1.0 0.000000
41 2022-01-01 04:02:34 1.0 0.000000
42 2022-01-01 04:07:34 1.0 0.000000
43 2022-01-01 04:12:34 1.0 0.000000
44 2022-01-01 04:17:34 1.0 0.000000
45 2022-01-01 04:22:34 1.0 0.000000
46 2022-01-01 04:27:34 1.0 0.000000
47 2022-01-01 04:32:34 1.0 0.000000
48 2022-01-01 04:37:34 1.0 0.000000
49 2022-01-01 04:42:34 1.0 0.049817
50 2022-01-01 00:03:06 2.0 0.463746
51 2022-01-01 00:08:06 2.0 0.660494
52 2022-01-01 00:13:06 2.0 0.403509
但是我不明白这个要求:
timestamp id val
0 2022-01-01 00:37:34 1.0 0.128464 \_ val mean
1 2022-01-01 00:52:15 1.0 0.823504 /
这 2 个时间戳不在 5 分钟间隔内?
IIUC,你需要2种不同的分组方式。一种根据超过 5 分钟的连续差异进行拆分,一种从组开始生成 5 分钟的块:
# ensure datetime
df['timestamp'] = pd.to_datetime(df['timestamp'])
# split if difference exceeds 5 min
group = df.groupby('id')['timestamp'].diff().gt('5min').cumsum()
# group by 5 min chunks
g = pd.Grouper(key="timestamp", freq='5min', origin='start')
# aggregate
(df
.groupby(['id', group, g], as_index=False)
.agg({'timestamp': 'first', 'val': 'mean'})
)
输出:
id timestamp val
0 1.0 2022-01-01 00:37:34 0.128464
1 1.0 2022-01-01 00:52:15 0.823504
2 1.0 2022-01-01 02:00:01 0.807617
3 1.0 2022-01-01 02:37:14 0.154851
4 1.0 2022-01-01 04:44:46 0.049817
5 2.0 2022-01-01 00:03:06 0.463746
6 2.0 2022-01-01 00:08:43 0.660494
7 2.0 2022-01-01 00:15:52 0.403509