Pandas - 基于时间间隔的 Bin 行

Pandas - Bin rows based on time interval

有以下DF:

             timestamp   id       val
0  2022-01-01 00:37:34  1.0  0.128464
1  2022-01-01 00:52:15  1.0  0.823504
2  2022-01-01 02:00:01  1.0  0.807617
3  2022-01-01 02:37:14  1.0  0.154851
4  2022-01-01 04:44:46  1.0  0.049817
5  2022-01-01 00:03:06  2.0  0.538565
6  2022-01-01 00:04:05  2.0  0.332919
7  2022-01-01 00:04:24  2.0  0.106591
8  2022-01-01 00:05:41  2.0  0.552562
9  2022-01-01 00:05:58  2.0  0.851130
10 2022-01-01 00:06:58  2.0  0.400711
11 2022-01-01 00:08:43  2.0  0.840532
12 2022-01-01 00:08:52  2.0  0.184425
13 2022-01-01 00:12:52  2.0  0.956525
14 2022-01-01 00:15:52  2.0  0.403509

我正在尝试对 5 分钟间隔内每一行的值进行分箱,并添加间隔超过 5 分钟的缺失行,如下所示:

             timestamp   id       val
0  2022-01-01 00:37:34  1.0  0.128464 \_ val mean
1  2022-01-01 00:52:15  1.0  0.823504 /
-------------------------------------
Add missing 5min intervals with val 0
from 2022-01-01 00:52:15
to 2022-01-01 02:00:01
-------------------------------------
2  2022-01-01 02:00:01  1.0  0.807617 - val
-------------------------------------
Add missing 5min intervals with val 0
from 2022-01-01 02:00:01
to 2022-01-01 02:37:14
-------------------------------------
3  2022-01-01 02:37:14  1.0  0.154851 - val
-------------------------------------
Add missing 5min intervals with val 0
from 2022-01-01 02:37:14
to 2022-01-01 04:44:46
-------------------------------------
4  2022-01-01 04:44:46  1.0  0.049817 - val

              New Group
-------------------------------------
5  2022-01-01 00:03:06  2.0  0.538565 \
6  2022-01-01 00:04:05  2.0  0.332919 |
7  2022-01-01 00:04:24  2.0  0.106591 |_ val mean
8  2022-01-01 00:05:41  2.0  0.552562 |
9  2022-01-01 00:05:58  2.0  0.851130 |
10 2022-01-01 00:06:58  2.0  0.400711 /
-------------------------------------
11 2022-01-01 00:08:43  2.0  0.840532 \
12 2022-01-01 00:08:52  2.0  0.184425 |_ val mean
13 2022-01-01 00:12:52  2.0  0.956525 /
-------------------------------------
14 2022-01-01 00:15:52  2.0  0.403509 - val

因此,生成的帧将包含 5 分钟的 val 平均值,以及没有 activity 发生的具有 0 val 的行。我尝试使用 pd.Grouper(key="timestamp", freq='5min', origin='start') 来获得 5 分钟的间隔,但我不确定下一步该从哪里开始。

如有任何帮助,我们将不胜感激。

这是您要找的吗:

def process(sdf):
    return (sdf.resample("5min", on="timestamp", origin=sdf.timestamp.iat[0])
               .mean().fillna({"id": sdf.name, "val": 0}))

df = (df.groupby("id", as_index=False).apply(process)
        .droplevel(level=0, axis=0).reset_index(drop=False))

结果:

             timestamp   id       val
0  2022-01-01 00:37:34  1.0  0.128464
1  2022-01-01 00:42:34  1.0  0.000000
2  2022-01-01 00:47:34  1.0  0.823504
3  2022-01-01 00:52:34  1.0  0.000000
4  2022-01-01 00:57:34  1.0  0.000000
5  2022-01-01 01:02:34  1.0  0.000000
6  2022-01-01 01:07:34  1.0  0.000000
7  2022-01-01 01:12:34  1.0  0.000000
8  2022-01-01 01:17:34  1.0  0.000000
9  2022-01-01 01:22:34  1.0  0.000000
10 2022-01-01 01:27:34  1.0  0.000000
11 2022-01-01 01:32:34  1.0  0.000000
12 2022-01-01 01:37:34  1.0  0.000000
13 2022-01-01 01:42:34  1.0  0.000000
14 2022-01-01 01:47:34  1.0  0.000000
15 2022-01-01 01:52:34  1.0  0.000000
16 2022-01-01 01:57:34  1.0  0.807617
17 2022-01-01 02:02:34  1.0  0.000000
18 2022-01-01 02:07:34  1.0  0.000000
19 2022-01-01 02:12:34  1.0  0.000000
20 2022-01-01 02:17:34  1.0  0.000000
21 2022-01-01 02:22:34  1.0  0.000000
22 2022-01-01 02:27:34  1.0  0.000000
23 2022-01-01 02:32:34  1.0  0.154851
24 2022-01-01 02:37:34  1.0  0.000000
25 2022-01-01 02:42:34  1.0  0.000000
26 2022-01-01 02:47:34  1.0  0.000000
27 2022-01-01 02:52:34  1.0  0.000000
28 2022-01-01 02:57:34  1.0  0.000000
29 2022-01-01 03:02:34  1.0  0.000000
30 2022-01-01 03:07:34  1.0  0.000000
31 2022-01-01 03:12:34  1.0  0.000000
32 2022-01-01 03:17:34  1.0  0.000000
33 2022-01-01 03:22:34  1.0  0.000000
34 2022-01-01 03:27:34  1.0  0.000000
35 2022-01-01 03:32:34  1.0  0.000000
36 2022-01-01 03:37:34  1.0  0.000000
37 2022-01-01 03:42:34  1.0  0.000000
38 2022-01-01 03:47:34  1.0  0.000000
39 2022-01-01 03:52:34  1.0  0.000000
40 2022-01-01 03:57:34  1.0  0.000000
41 2022-01-01 04:02:34  1.0  0.000000
42 2022-01-01 04:07:34  1.0  0.000000
43 2022-01-01 04:12:34  1.0  0.000000
44 2022-01-01 04:17:34  1.0  0.000000
45 2022-01-01 04:22:34  1.0  0.000000
46 2022-01-01 04:27:34  1.0  0.000000
47 2022-01-01 04:32:34  1.0  0.000000
48 2022-01-01 04:37:34  1.0  0.000000
49 2022-01-01 04:42:34  1.0  0.049817
50 2022-01-01 00:03:06  2.0  0.463746
51 2022-01-01 00:08:06  2.0  0.660494
52 2022-01-01 00:13:06  2.0  0.403509

但是我不明白这个要求:

             timestamp   id       val
0  2022-01-01 00:37:34  1.0  0.128464 \_ val mean
1  2022-01-01 00:52:15  1.0  0.823504 /

这 2 个时间戳不在 5 分钟间隔内?

IIUC,你需要2种不同的分组方式。一种根据超过 5 分钟的连续差异进行拆分,一种从组开始生成 5 分钟的块:

# ensure datetime
df['timestamp'] = pd.to_datetime(df['timestamp'])

# split if difference exceeds 5 min
group = df.groupby('id')['timestamp'].diff().gt('5min').cumsum()

# group by 5 min chunks
g = pd.Grouper(key="timestamp", freq='5min', origin='start')

# aggregate
(df
 .groupby(['id', group, g], as_index=False)
 .agg({'timestamp': 'first', 'val': 'mean'})
)

输出:

    id           timestamp       val
0  1.0 2022-01-01 00:37:34  0.128464
1  1.0 2022-01-01 00:52:15  0.823504
2  1.0 2022-01-01 02:00:01  0.807617
3  1.0 2022-01-01 02:37:14  0.154851
4  1.0 2022-01-01 04:44:46  0.049817
5  2.0 2022-01-01 00:03:06  0.463746
6  2.0 2022-01-01 00:08:43  0.660494
7  2.0 2022-01-01 00:15:52  0.403509