Python Pandas 正在计算事件发生之间的时间增量
Python Pandas Calculating timedelta between event occurrences
我有一个 Pandas (0.14.1) 数据框,它有一个 datetime
和一个 event
列,如下所示:
import pandas as pd
from datetime import datetime
from datetime import timedelta
def perdelta(start, end, delta):
curr = start
while curr < end:
yield curr;
curr += delta;
events = [np.nan] * 20; events[5]=20; events[12]=3; events[15]=10;
n = len(events)
signal = [i/10.0 for i in range(n)] + np.random.randn(n)
df = pd.DataFrame( { 'level1': signal,
'event' : events,
'datetime': [r for r in perdelta(datetime.now(), datetime.now() + timedelta(minutes=10) , timedelta(seconds=30))]},
index=range(n))
df.head(7)
datetime event level1
0 2016-07-14 10:44:47.035000 NaN 0.158594
1 2016-07-14 10:45:17.035000 NaN 0.282749
2 2016-07-14 10:45:47.035000 NaN 0.448012
3 2016-07-14 10:46:17.035000 NaN 0.590702
4 2016-07-14 10:46:47.035000 NaN -0.346073
5 2016-07-14 10:47:17.035000 20 0.072986
6 2016-07-14 10:47:47.035000 NaN 1.493900
我想包括一个 t_since_last_event
列,用于计算自上次事件发生以来每个时间步的增量时间。生成的 df 应如下所示:
df
datetime event level1 t_since_last_event
0 2016-07-14 10:44:47.035000 NaN 0.158594 0
1 2016-07-14 10:45:17.035000 NaN 0.282749 30
2 2016-07-14 10:45:47.035000 NaN 0.448012 60
3 2016-07-14 10:46:17.035000 NaN 0.590702 90
4 2016-07-14 10:46:47.035000 NaN -0.346073 120
5 2016-07-14 10:47:17.035000 20 0.072986 0
6 2016-07-14 10:47:47.035000 NaN 1.493900 30
7 2016-07-14 10:48:17.035000 NaN -0.143081 60
8 2016-07-14 10:48:47.035000 NaN 0.173715 90
9 2016-07-14 10:49:17.035000 NaN 1.232040 120
10 2016-07-14 10:49:47.035000 NaN 3.497438 150
11 2016-07-14 10:50:17.035000 NaN 0.956582 180
12 2016-07-14 10:50:47.035000 3 2.976383 0
13 2016-07-14 10:51:17.035000 NaN 0.599698 30
14 2016-07-14 10:51:47.035000 NaN 2.538005 60
15 2016-07-14 10:52:17.035000 10 1.362104 0
16 2016-07-14 10:52:47.035000 NaN 2.224680 30
17 2016-07-14 10:53:17.035000 NaN 3.221037 60
18 2016-07-14 10:53:47.035000 NaN 1.869479 90
19 2016-07-14 10:54:17.035000 NaN 1.447430 120
在 Pandas 中有没有聪明的方法来做到这一点?它涉及水平分组(按事件发生)和垂直计数,因此解决方案对我来说不太明显。我已经在下面发布了我的常规解决方案。
这是我平淡无奇的解决方案。我怀疑应该有一个更快的 Pandas 解决方案。垂直和水平依赖的存在使得apply()
或groupby()
等更难处理
last_trade_time = df.iloc[0]['datetime']
t=[np.nan] * len(df)
for i, row in df.iterrows():
if np.isnan(row['event']):
t[i] = row['datetime'] - last_trade_time
else:
t[i] = 0
last_trade_time = row['datetime']
df['t_since_last_event'] = t
矢量化在这里应该很简单:
- 添加另一列用于保存上次活动时间
- 如果
event
不是 NaN,则在此列中设置事件时间,否则为 NaN
- 使用方法
ffill
填充 NaN 值
- 从
datetime
列中减去。
即使使用 pandas 0.14.1:
这也应该有效
mask = df['event'].notnull()
df['last_event_time'] = np.NaN
df.loc[mask, 'last_event_time'] = df.loc[mask, 'datetime']
df['last_event_time'] = df['last_event_time'].fillna(method='ffill')
df['t_since_last_event'] = df['datetime'] - df['last_event_time']
您可能还想在一开始就将 event
中的第一个元素设置为零;或者,mask[0] = True
.
我有一个 Pandas (0.14.1) 数据框,它有一个 datetime
和一个 event
列,如下所示:
import pandas as pd
from datetime import datetime
from datetime import timedelta
def perdelta(start, end, delta):
curr = start
while curr < end:
yield curr;
curr += delta;
events = [np.nan] * 20; events[5]=20; events[12]=3; events[15]=10;
n = len(events)
signal = [i/10.0 for i in range(n)] + np.random.randn(n)
df = pd.DataFrame( { 'level1': signal,
'event' : events,
'datetime': [r for r in perdelta(datetime.now(), datetime.now() + timedelta(minutes=10) , timedelta(seconds=30))]},
index=range(n))
df.head(7)
datetime event level1
0 2016-07-14 10:44:47.035000 NaN 0.158594
1 2016-07-14 10:45:17.035000 NaN 0.282749
2 2016-07-14 10:45:47.035000 NaN 0.448012
3 2016-07-14 10:46:17.035000 NaN 0.590702
4 2016-07-14 10:46:47.035000 NaN -0.346073
5 2016-07-14 10:47:17.035000 20 0.072986
6 2016-07-14 10:47:47.035000 NaN 1.493900
我想包括一个 t_since_last_event
列,用于计算自上次事件发生以来每个时间步的增量时间。生成的 df 应如下所示:
df
datetime event level1 t_since_last_event
0 2016-07-14 10:44:47.035000 NaN 0.158594 0
1 2016-07-14 10:45:17.035000 NaN 0.282749 30
2 2016-07-14 10:45:47.035000 NaN 0.448012 60
3 2016-07-14 10:46:17.035000 NaN 0.590702 90
4 2016-07-14 10:46:47.035000 NaN -0.346073 120
5 2016-07-14 10:47:17.035000 20 0.072986 0
6 2016-07-14 10:47:47.035000 NaN 1.493900 30
7 2016-07-14 10:48:17.035000 NaN -0.143081 60
8 2016-07-14 10:48:47.035000 NaN 0.173715 90
9 2016-07-14 10:49:17.035000 NaN 1.232040 120
10 2016-07-14 10:49:47.035000 NaN 3.497438 150
11 2016-07-14 10:50:17.035000 NaN 0.956582 180
12 2016-07-14 10:50:47.035000 3 2.976383 0
13 2016-07-14 10:51:17.035000 NaN 0.599698 30
14 2016-07-14 10:51:47.035000 NaN 2.538005 60
15 2016-07-14 10:52:17.035000 10 1.362104 0
16 2016-07-14 10:52:47.035000 NaN 2.224680 30
17 2016-07-14 10:53:17.035000 NaN 3.221037 60
18 2016-07-14 10:53:47.035000 NaN 1.869479 90
19 2016-07-14 10:54:17.035000 NaN 1.447430 120
在 Pandas 中有没有聪明的方法来做到这一点?它涉及水平分组(按事件发生)和垂直计数,因此解决方案对我来说不太明显。我已经在下面发布了我的常规解决方案。
这是我平淡无奇的解决方案。我怀疑应该有一个更快的 Pandas 解决方案。垂直和水平依赖的存在使得apply()
或groupby()
等更难处理
last_trade_time = df.iloc[0]['datetime']
t=[np.nan] * len(df)
for i, row in df.iterrows():
if np.isnan(row['event']):
t[i] = row['datetime'] - last_trade_time
else:
t[i] = 0
last_trade_time = row['datetime']
df['t_since_last_event'] = t
矢量化在这里应该很简单:
- 添加另一列用于保存上次活动时间
- 如果
event
不是 NaN,则在此列中设置事件时间,否则为 NaN - 使用方法
ffill
填充 NaN 值
- 从
datetime
列中减去。
即使使用 pandas 0.14.1:
这也应该有效mask = df['event'].notnull()
df['last_event_time'] = np.NaN
df.loc[mask, 'last_event_time'] = df.loc[mask, 'datetime']
df['last_event_time'] = df['last_event_time'].fillna(method='ffill')
df['t_since_last_event'] = df['datetime'] - df['last_event_time']
您可能还想在一开始就将 event
中的第一个元素设置为零;或者,mask[0] = True
.