pandas 中多个不规则采样事件的距离
distance to multiple irregular sampled events in pandas
我有一个不规则的采样时间序列
event
Time
2013-01-01 01:40:53.072 n
2013-01-01 01:41:25.563 e
2013-01-01 01:51:23.293 e
2013-01-01 01:57:14.168 e
2013-01-01 01:58:07.273 e
2013-01-01 02:05:36.250 e
2013-01-01 02:35:08.501 e
2013-01-01 02:37:36.498 e
2013-01-01 03:22:15.091 e
2013-01-01 03:35:58.140 e
2013-01-01 03:39:47.682 e
2013-01-01 04:22:18.756 e
2013-01-01 04:33:08.892 e
2013-01-01 04:43:17.985 n
2013-01-01 04:49:49.281 e
2013-01-01 05:10:26.957 e
2013-01-01 05:17:15.411 e
2013-01-01 06:11:15.033 e
2013-01-01 06:46:36.406 e
2013-01-01 07:26:00.488 e
我想计算每个事件之间的累计经过时间 n
。
有一个类似的问题(Pandas time series time between events),但由于时间索引不规则,我无法使解决方案适应我的问题。我的尝试是使用 df1['diff']=df1.groupby('event_bool')['event_time'].diff()
获得这样的东西:
event event_bool diff
Time
2013-01-01 01:40:53.072 n True NaT
2013-01-01 01:41:25.563 e False NaT
2013-01-01 01:51:23.293 e False 00:09:57.730000
2013-01-01 01:57:14.168 e False 00:05:50.875000
2013-01-01 01:58:07.273 e False 00:00:53.105000
2013-01-01 02:05:36.250 e False 00:07:28.977000
2013-01-01 02:35:08.501 e False 00:29:32.251000
2013-01-01 02:37:36.498 e False 00:02:27.997000
2013-01-01 03:22:15.091 e False 00:44:38.593000
2013-01-01 03:35:58.140 e False 00:13:43.049000
2013-01-01 03:39:47.682 e False 00:03:49.542000
2013-01-01 04:22:18.756 e False 00:42:31.074000
2013-01-01 04:33:08.892 e False 00:10:50.136000
2013-01-01 04:43:17.985 n True NaT
2013-01-01 04:49:49.281 e False 00:16:40.389000
2013-01-01 05:10:26.957 e False 00:20:37.676000
2013-01-01 05:17:15.411 e False 00:06:48.454000
2013-01-01 06:11:15.033 e False 00:53:59.622000
2013-01-01 06:46:36.406 e False 00:35:21.373000
2013-01-01 07:26:00.488 e False 00:39:24.082000
但是我有以下未解决的问题:
n
之后的第一个事件 e
有一个 NaT。结果应该是 `00:00:32.491000``
- 如何累计
n
事件之间经过的时间?
不确定 NaT 是什么,但您可以使用填充方法替换 diff 列的所有空值。然后使用 .sum() 聚合方法。
让我们尝试以下操作:
df = df.reset_index()
df_out = pd.concat([df,df.where(df['event'] == 'e').dropna( )['Time'].diff().rename('diff')],axis=1)
df_out = pd.concat([df,df['Time'].diff().rename('diff').mask(df['event'] == 'n')],axis=1)
df_out['cum diff'] = df_out.groupby((df_out.event == 'n').cumsum())['diff'].transform(lambda x: x.fillna(0).cumsum())
df_out = df_out.set_index('Time')
更新后的输出:
Time event diff cum diff
0 2013-01-01 01:40:53.072 n NaT 00:00:00
1 2013-01-01 01:41:25.563 e 00:00:32.491000 00:00:32.491000
2 2013-01-01 01:51:23.293 e 00:09:57.730000 00:10:30.221000
3 2013-01-01 01:57:14.168 e 00:05:50.875000 00:16:21.096000
4 2013-01-01 01:58:07.273 e 00:00:53.105000 00:17:14.201000
5 2013-01-01 02:05:36.250 e 00:07:28.977000 00:24:43.178000
6 2013-01-01 02:35:08.501 e 00:29:32.251000 00:54:15.429000
7 2013-01-01 02:37:36.498 e 00:02:27.997000 00:56:43.426000
8 2013-01-01 03:22:15.091 e 00:44:38.593000 01:41:22.019000
9 2013-01-01 03:35:58.140 e 00:13:43.049000 01:55:05.068000
10 2013-01-01 03:39:47.682 e 00:03:49.542000 01:58:54.610000
11 2013-01-01 04:22:18.756 e 00:42:31.074000 02:41:25.684000
12 2013-01-01 04:33:08.892 e 00:10:50.136000 02:52:15.820000
13 2013-01-01 04:43:17.985 n NaT 00:00:00
14 2013-01-01 04:49:49.281 e 00:06:31.296000 00:06:31.296000
15 2013-01-01 05:10:26.957 e 00:20:37.676000 00:27:08.972000
16 2013-01-01 05:17:15.411 e 00:06:48.454000 00:33:57.426000
17 2013-01-01 06:11:15.033 e 00:53:59.622000 01:27:57.048000
18 2013-01-01 06:46:36.406 e 00:35:21.373000 02:03:18.421000
19 2013-01-01 07:26:00.488 e 00:39:24.082000 02:42:42.503000
首先我想到了一个使用循环的解决方案,如下所示:
times = []
for index, row in df.iterrows():
if(row['event'] == 'n'):
last = row['Time']
times.append(row['Time'] - last)
df['TimeNew'] = times
但后来,我看到了另一个答案,我尝试 运行 一些测试,看看哪个表现更好。
我 运行 每个方法都用了 10 次,平均时间是:
Lines | Loop method (s) | lambda method (s) |
---------------------------------------------
21 | 0,006838305 | 0,013882545 |
504 | 0,092648337 | 0,056006076 |
1000 | 0,169315854 | 0,097687499 |
10000 | 1,414376600 | 0,746927508 |
Execution time by method
这里贴出的答案确实是数据多了速度更快。
对于正常循环的性能,这并不令人惊讶。
我有一个不规则的采样时间序列
event
Time
2013-01-01 01:40:53.072 n
2013-01-01 01:41:25.563 e
2013-01-01 01:51:23.293 e
2013-01-01 01:57:14.168 e
2013-01-01 01:58:07.273 e
2013-01-01 02:05:36.250 e
2013-01-01 02:35:08.501 e
2013-01-01 02:37:36.498 e
2013-01-01 03:22:15.091 e
2013-01-01 03:35:58.140 e
2013-01-01 03:39:47.682 e
2013-01-01 04:22:18.756 e
2013-01-01 04:33:08.892 e
2013-01-01 04:43:17.985 n
2013-01-01 04:49:49.281 e
2013-01-01 05:10:26.957 e
2013-01-01 05:17:15.411 e
2013-01-01 06:11:15.033 e
2013-01-01 06:46:36.406 e
2013-01-01 07:26:00.488 e
我想计算每个事件之间的累计经过时间 n
。
有一个类似的问题(Pandas time series time between events),但由于时间索引不规则,我无法使解决方案适应我的问题。我的尝试是使用 df1['diff']=df1.groupby('event_bool')['event_time'].diff()
获得这样的东西:
event event_bool diff
Time
2013-01-01 01:40:53.072 n True NaT
2013-01-01 01:41:25.563 e False NaT
2013-01-01 01:51:23.293 e False 00:09:57.730000
2013-01-01 01:57:14.168 e False 00:05:50.875000
2013-01-01 01:58:07.273 e False 00:00:53.105000
2013-01-01 02:05:36.250 e False 00:07:28.977000
2013-01-01 02:35:08.501 e False 00:29:32.251000
2013-01-01 02:37:36.498 e False 00:02:27.997000
2013-01-01 03:22:15.091 e False 00:44:38.593000
2013-01-01 03:35:58.140 e False 00:13:43.049000
2013-01-01 03:39:47.682 e False 00:03:49.542000
2013-01-01 04:22:18.756 e False 00:42:31.074000
2013-01-01 04:33:08.892 e False 00:10:50.136000
2013-01-01 04:43:17.985 n True NaT
2013-01-01 04:49:49.281 e False 00:16:40.389000
2013-01-01 05:10:26.957 e False 00:20:37.676000
2013-01-01 05:17:15.411 e False 00:06:48.454000
2013-01-01 06:11:15.033 e False 00:53:59.622000
2013-01-01 06:46:36.406 e False 00:35:21.373000
2013-01-01 07:26:00.488 e False 00:39:24.082000
但是我有以下未解决的问题:
n
之后的第一个事件e
有一个 NaT。结果应该是 `00:00:32.491000``- 如何累计
n
事件之间经过的时间?
不确定 NaT 是什么,但您可以使用填充方法替换 diff 列的所有空值。然后使用 .sum() 聚合方法。
让我们尝试以下操作:
df = df.reset_index()
df_out = pd.concat([df,df.where(df['event'] == 'e').dropna( )['Time'].diff().rename('diff')],axis=1)
df_out = pd.concat([df,df['Time'].diff().rename('diff').mask(df['event'] == 'n')],axis=1)
df_out['cum diff'] = df_out.groupby((df_out.event == 'n').cumsum())['diff'].transform(lambda x: x.fillna(0).cumsum())
df_out = df_out.set_index('Time')
更新后的输出:
Time event diff cum diff
0 2013-01-01 01:40:53.072 n NaT 00:00:00
1 2013-01-01 01:41:25.563 e 00:00:32.491000 00:00:32.491000
2 2013-01-01 01:51:23.293 e 00:09:57.730000 00:10:30.221000
3 2013-01-01 01:57:14.168 e 00:05:50.875000 00:16:21.096000
4 2013-01-01 01:58:07.273 e 00:00:53.105000 00:17:14.201000
5 2013-01-01 02:05:36.250 e 00:07:28.977000 00:24:43.178000
6 2013-01-01 02:35:08.501 e 00:29:32.251000 00:54:15.429000
7 2013-01-01 02:37:36.498 e 00:02:27.997000 00:56:43.426000
8 2013-01-01 03:22:15.091 e 00:44:38.593000 01:41:22.019000
9 2013-01-01 03:35:58.140 e 00:13:43.049000 01:55:05.068000
10 2013-01-01 03:39:47.682 e 00:03:49.542000 01:58:54.610000
11 2013-01-01 04:22:18.756 e 00:42:31.074000 02:41:25.684000
12 2013-01-01 04:33:08.892 e 00:10:50.136000 02:52:15.820000
13 2013-01-01 04:43:17.985 n NaT 00:00:00
14 2013-01-01 04:49:49.281 e 00:06:31.296000 00:06:31.296000
15 2013-01-01 05:10:26.957 e 00:20:37.676000 00:27:08.972000
16 2013-01-01 05:17:15.411 e 00:06:48.454000 00:33:57.426000
17 2013-01-01 06:11:15.033 e 00:53:59.622000 01:27:57.048000
18 2013-01-01 06:46:36.406 e 00:35:21.373000 02:03:18.421000
19 2013-01-01 07:26:00.488 e 00:39:24.082000 02:42:42.503000
首先我想到了一个使用循环的解决方案,如下所示:
times = []
for index, row in df.iterrows():
if(row['event'] == 'n'):
last = row['Time']
times.append(row['Time'] - last)
df['TimeNew'] = times
但后来,我看到了另一个答案,我尝试 运行 一些测试,看看哪个表现更好。 我 运行 每个方法都用了 10 次,平均时间是:
Lines | Loop method (s) | lambda method (s) |
---------------------------------------------
21 | 0,006838305 | 0,013882545 |
504 | 0,092648337 | 0,056006076 |
1000 | 0,169315854 | 0,097687499 |
10000 | 1,414376600 | 0,746927508 |
Execution time by method
这里贴出的答案确实是数据多了速度更快。 对于正常循环的性能,这并不令人惊讶。