Pandas - 从具有开始和结束日期和时间的事件数据帧创建一个 10 分钟的时间序列
Pandas - Create a 10-min time series from dataframe of events with start and end date and time
我有一个包含开始和结束日期的事件的数据框 df1:
ID EventID Start End Duration
0 G01 1001 2017-10-16 06:03:37.440 2017-10-16 06:24:24.440 00:20:47
1 G07 1001 2017-10-16 06:11:04.600 2017-10-16 07:28:43.520 01:17:38.920000
2 G02 1001 2017-10-16 06:15:36.200 2017-10-16 06:23:36.200 00:08:00
3 G06 1001 2017-10-16 06:18:21.160 2017-10-16 06:23:36.120 00:05:14.960000
4 G03 1001 2017-10-16 06:29:20.640 2017-10-16 06:47:20.640 00:18:00
5 G05 1001 2017-10-16 06:29:41.640 2017-10-16 06:36:26.640 00:06:45
我正在尝试将其更改为以十分钟为增量的时间序列,其中我记录了过去十分钟内每个事件的持续时间(没有事件时持续时间为零)-我期待的结果是看起来像这样:
Start ID EventID Duration
0 2017-10-16 06:10:00 G01 1001 0:06:22.560000
1 2017-10-16 06:20:00 G01 1001 00:10:00
2 2017-10-16 06:30:00 G01 1001 00:05:35.560000
3 2017-10-16 06:40:00 G01 1001 00:00:00
4 2017-10-16 06:50:00 G01 1001 00:00:00
(如果有一个很好的方法可以做到这一点,它只 returns 随着相关事件的增加,即不显示持续时间为 00:00:00 的行,这也很好)
这是我到目前为止生成的代码(为每个 ID 创建一个数据框):
df1.set_index(df1['Start'], inplace = True)
df1.rename(columns={'Start':'Start_Time'}, inplace=True)
df1.index = df1.index.ceil('10min')
df2 = df1.where(df1['ID'] == 'G01').dropna()
df2 = df2.asfreq('10Min', method = 'pad').reset_index()
for row in df2.itertuples():
ten_min = df2.Start[1]-df2.Start[0]
zero_min = df2.Start[1]-df2.Start[1]
if row.Start > row.End and row.Start > row.Start_Time:
if (row.Start - row.End) < ten_min:
df2.loc[row.Index, 'Duration'] = row.Start - row.End
else:
df2.loc[row.Index, 'Duration'] = zero_min
if row.Start < row.End:
if (row.Start - row.Start_Time) < ten_min:
df2.loc[row.Index, 'Duration'] = row.Start - row.Start_Time
else:
df2.loc[row.Index, 'Duration'] = ten_min
问题:
- 这适用于我的第一个 ID,但 G02 已经是一个问题,因为它有好几次事件在同一十分钟内开始和结束 - 我的 asfreq() 方法不适用于非唯一索引
- 计算需要很多时间 - 我正在寻找提高性能的建议。
欢迎任何反馈!
这是一个感觉有点笨拙的解决方案,但执行速度可能足以满足您的需求。
def get_durations(df_subset):
'''A helper function to be passed to df.apply().'''
# If each ID only has 1 row in the input DataFrame, then
# any of .min(), .max(), or .iloc[0] should work here
t1 = df_subset['Start'].min()
t2 = df_subset['End'].max()
# Build a DatetimeIndex whose start and end values are the next
# available 10-minute tickmarks after t1 and t2
idx = pd.date_range(t1.ceil('10min'), t2.ceil('10min'), freq='10min')
# Calculate 10-minute durations with .diff(). Note that
# idx.to_series() returns a series whose values are all equal to the
# corresponding values of its own DatetimeIndex. So dur.index is
# what we will call 'Start' and dur.values will be 'Duration'
dur = idx.to_series().diff()
# Manually adjust the first and last durations
dur[0] = idx[0] - t1
dur[-1] = idx[-1] - t2
dur.index.rename('Start', inplace=True)
return dur
# Apply the above function to each ID in the input DataFrame
df.groupby(['ID', 'EventID']).apply(f).rename('Duration').to_frame().reset_index()
# Output:
ID EventID Start Duration
0 G01 1001 2017-10-16 06:10:00 00:06:22.560000
1 G01 1001 2017-10-16 06:20:00 00:10:00
2 G01 1001 2017-10-16 06:30:00 00:05:35.560000
3 G02 1001 2017-10-16 06:20:00 00:04:23.800000
4 G02 1001 2017-10-16 06:30:00 00:06:23.800000
5 G03 1001 2017-10-16 06:30:00 00:00:39.360000
6 G03 1001 2017-10-16 06:40:00 00:10:00
7 G03 1001 2017-10-16 06:50:00 00:02:39.360000
8 G05 1001 2017-10-16 06:30:00 00:00:18.360000
9 G05 1001 2017-10-16 06:40:00 00:03:33.360000
10 G06 1001 2017-10-16 06:20:00 00:01:38.840000
11 G06 1001 2017-10-16 06:30:00 00:06:23.880000
12 G07 1001 2017-10-16 06:20:00 00:08:55.400000
13 G07 1001 2017-10-16 06:30:00 00:10:00
14 G07 1001 2017-10-16 06:40:00 00:10:00
15 G07 1001 2017-10-16 06:50:00 00:10:00
16 G07 1001 2017-10-16 07:00:00 00:10:00
17 G07 1001 2017-10-16 07:10:00 00:10:00
18 G07 1001 2017-10-16 07:20:00 00:10:00
19 G07 1001 2017-10-16 07:30:00 00:01:16.480000
Peter 的回答对我帮助很大,我不得不采取一些变通办法,因为我在数据集中确实有每个 ID 的多个实例 + 一些在同一十分钟内开始和结束的事件。我还创建了一个新的 table,其中所有 ID 彼此相邻。这是代码(在 1.6 秒内运行):
def get_durations(x):
df2 = df1.where(df1['ID'] == x).dropna()
duration = []
index_dur = []
for row in df2.itertuples():
t1 = row.Start
t2 = row.End
idx = pd.date_range(t1.ceil('10min'), t2.ceil('10min'), freq='10min')
dur = idx.to_series().diff()
if len(dur) > 1:
dur[0] = idx[0] - t1
dur[-1] = t2 - idx[-2]
else:
dur[0] = t2-t1
dur.index.rename('Start', inplace=True)
duration.extend(dur)
index_dur.extend(dur.index)
df3 = pd.DataFrame(duration, index_dur).reset_index()
df3.columns= ['Time', x]
return pd.DataFrame(df3.groupby('Time')[x].sum()).reset_index()
IDs_list = df1['ID'].unique().tolist()
G01 = get_durations(turbines[0])
G07 = get_durations(turbines[1])
G02 = get_durations(turbines[2])
G06 = get_durations(turbines[3])
G03 = get_durations(turbines[4])
G05 = get_durations(turbines[5])
G04 = get_durations(turbines[6])
df4 = G01.merge(G02, how='outer', on='Time').merge(G03, how='outer', on='Time') on='Time')
输出为:
Time G01 G02 G03
0 2017-10-16 06:10:00 00:06:22.560000 NaT NaT
1 2017-10-16 06:20:00 00:10:00 00:04:23.800000 NaT
2 2017-10-16 06:30:00 00:04:24.440000 00:03:38 00:00:39.360000
3 2017-10-20 22:00:00 00:06:13.040000 NaT NaT
4 2017-10-21 12:30:00 00:05:17.960000 NaT NaT
5 2017-12-13 15:50:00 00:00:14.480000 NaT NaT
6 2017-12-13 16:00:00 00:02:57.520000 NaT NaT
7 2017-12-29 06:00:00 00:05:18 00:04:16.960000 00:04:48
现在可能还有改进的余地。我对 apply() 方法不是很好,我确信此时可以用它做一些事情,仍在采纳改进建议,否则这是有效的:)
我有一个包含开始和结束日期的事件的数据框 df1:
ID EventID Start End Duration
0 G01 1001 2017-10-16 06:03:37.440 2017-10-16 06:24:24.440 00:20:47
1 G07 1001 2017-10-16 06:11:04.600 2017-10-16 07:28:43.520 01:17:38.920000
2 G02 1001 2017-10-16 06:15:36.200 2017-10-16 06:23:36.200 00:08:00
3 G06 1001 2017-10-16 06:18:21.160 2017-10-16 06:23:36.120 00:05:14.960000
4 G03 1001 2017-10-16 06:29:20.640 2017-10-16 06:47:20.640 00:18:00
5 G05 1001 2017-10-16 06:29:41.640 2017-10-16 06:36:26.640 00:06:45
我正在尝试将其更改为以十分钟为增量的时间序列,其中我记录了过去十分钟内每个事件的持续时间(没有事件时持续时间为零)-我期待的结果是看起来像这样:
Start ID EventID Duration
0 2017-10-16 06:10:00 G01 1001 0:06:22.560000
1 2017-10-16 06:20:00 G01 1001 00:10:00
2 2017-10-16 06:30:00 G01 1001 00:05:35.560000
3 2017-10-16 06:40:00 G01 1001 00:00:00
4 2017-10-16 06:50:00 G01 1001 00:00:00
(如果有一个很好的方法可以做到这一点,它只 returns 随着相关事件的增加,即不显示持续时间为 00:00:00 的行,这也很好)
这是我到目前为止生成的代码(为每个 ID 创建一个数据框):
df1.set_index(df1['Start'], inplace = True)
df1.rename(columns={'Start':'Start_Time'}, inplace=True)
df1.index = df1.index.ceil('10min')
df2 = df1.where(df1['ID'] == 'G01').dropna()
df2 = df2.asfreq('10Min', method = 'pad').reset_index()
for row in df2.itertuples():
ten_min = df2.Start[1]-df2.Start[0]
zero_min = df2.Start[1]-df2.Start[1]
if row.Start > row.End and row.Start > row.Start_Time:
if (row.Start - row.End) < ten_min:
df2.loc[row.Index, 'Duration'] = row.Start - row.End
else:
df2.loc[row.Index, 'Duration'] = zero_min
if row.Start < row.End:
if (row.Start - row.Start_Time) < ten_min:
df2.loc[row.Index, 'Duration'] = row.Start - row.Start_Time
else:
df2.loc[row.Index, 'Duration'] = ten_min
问题:
- 这适用于我的第一个 ID,但 G02 已经是一个问题,因为它有好几次事件在同一十分钟内开始和结束 - 我的 asfreq() 方法不适用于非唯一索引
- 计算需要很多时间 - 我正在寻找提高性能的建议。
欢迎任何反馈!
这是一个感觉有点笨拙的解决方案,但执行速度可能足以满足您的需求。
def get_durations(df_subset):
'''A helper function to be passed to df.apply().'''
# If each ID only has 1 row in the input DataFrame, then
# any of .min(), .max(), or .iloc[0] should work here
t1 = df_subset['Start'].min()
t2 = df_subset['End'].max()
# Build a DatetimeIndex whose start and end values are the next
# available 10-minute tickmarks after t1 and t2
idx = pd.date_range(t1.ceil('10min'), t2.ceil('10min'), freq='10min')
# Calculate 10-minute durations with .diff(). Note that
# idx.to_series() returns a series whose values are all equal to the
# corresponding values of its own DatetimeIndex. So dur.index is
# what we will call 'Start' and dur.values will be 'Duration'
dur = idx.to_series().diff()
# Manually adjust the first and last durations
dur[0] = idx[0] - t1
dur[-1] = idx[-1] - t2
dur.index.rename('Start', inplace=True)
return dur
# Apply the above function to each ID in the input DataFrame
df.groupby(['ID', 'EventID']).apply(f).rename('Duration').to_frame().reset_index()
# Output:
ID EventID Start Duration
0 G01 1001 2017-10-16 06:10:00 00:06:22.560000
1 G01 1001 2017-10-16 06:20:00 00:10:00
2 G01 1001 2017-10-16 06:30:00 00:05:35.560000
3 G02 1001 2017-10-16 06:20:00 00:04:23.800000
4 G02 1001 2017-10-16 06:30:00 00:06:23.800000
5 G03 1001 2017-10-16 06:30:00 00:00:39.360000
6 G03 1001 2017-10-16 06:40:00 00:10:00
7 G03 1001 2017-10-16 06:50:00 00:02:39.360000
8 G05 1001 2017-10-16 06:30:00 00:00:18.360000
9 G05 1001 2017-10-16 06:40:00 00:03:33.360000
10 G06 1001 2017-10-16 06:20:00 00:01:38.840000
11 G06 1001 2017-10-16 06:30:00 00:06:23.880000
12 G07 1001 2017-10-16 06:20:00 00:08:55.400000
13 G07 1001 2017-10-16 06:30:00 00:10:00
14 G07 1001 2017-10-16 06:40:00 00:10:00
15 G07 1001 2017-10-16 06:50:00 00:10:00
16 G07 1001 2017-10-16 07:00:00 00:10:00
17 G07 1001 2017-10-16 07:10:00 00:10:00
18 G07 1001 2017-10-16 07:20:00 00:10:00
19 G07 1001 2017-10-16 07:30:00 00:01:16.480000
Peter 的回答对我帮助很大,我不得不采取一些变通办法,因为我在数据集中确实有每个 ID 的多个实例 + 一些在同一十分钟内开始和结束的事件。我还创建了一个新的 table,其中所有 ID 彼此相邻。这是代码(在 1.6 秒内运行):
def get_durations(x):
df2 = df1.where(df1['ID'] == x).dropna()
duration = []
index_dur = []
for row in df2.itertuples():
t1 = row.Start
t2 = row.End
idx = pd.date_range(t1.ceil('10min'), t2.ceil('10min'), freq='10min')
dur = idx.to_series().diff()
if len(dur) > 1:
dur[0] = idx[0] - t1
dur[-1] = t2 - idx[-2]
else:
dur[0] = t2-t1
dur.index.rename('Start', inplace=True)
duration.extend(dur)
index_dur.extend(dur.index)
df3 = pd.DataFrame(duration, index_dur).reset_index()
df3.columns= ['Time', x]
return pd.DataFrame(df3.groupby('Time')[x].sum()).reset_index()
IDs_list = df1['ID'].unique().tolist()
G01 = get_durations(turbines[0])
G07 = get_durations(turbines[1])
G02 = get_durations(turbines[2])
G06 = get_durations(turbines[3])
G03 = get_durations(turbines[4])
G05 = get_durations(turbines[5])
G04 = get_durations(turbines[6])
df4 = G01.merge(G02, how='outer', on='Time').merge(G03, how='outer', on='Time') on='Time')
输出为:
Time G01 G02 G03
0 2017-10-16 06:10:00 00:06:22.560000 NaT NaT
1 2017-10-16 06:20:00 00:10:00 00:04:23.800000 NaT
2 2017-10-16 06:30:00 00:04:24.440000 00:03:38 00:00:39.360000
3 2017-10-20 22:00:00 00:06:13.040000 NaT NaT
4 2017-10-21 12:30:00 00:05:17.960000 NaT NaT
5 2017-12-13 15:50:00 00:00:14.480000 NaT NaT
6 2017-12-13 16:00:00 00:02:57.520000 NaT NaT
7 2017-12-29 06:00:00 00:05:18 00:04:16.960000 00:04:48
现在可能还有改进的余地。我对 apply() 方法不是很好,我确信此时可以用它做一些事情,仍在采纳改进建议,否则这是有效的:)