Pandas

Question

我有一个包含开始和结束日期的事件的数据框 df1：

    ID  EventID Start                    End                        Duration
0   G01 1001    2017-10-16 06:03:37.440  2017-10-16 06:24:24.440    00:20:47
1   G07 1001    2017-10-16 06:11:04.600  2017-10-16 07:28:43.520    01:17:38.920000
2   G02 1001    2017-10-16 06:15:36.200  2017-10-16 06:23:36.200    00:08:00
3   G06 1001    2017-10-16 06:18:21.160  2017-10-16 06:23:36.120    00:05:14.960000
4   G03 1001    2017-10-16 06:29:20.640  2017-10-16 06:47:20.640    00:18:00
5   G05 1001    2017-10-16 06:29:41.640  2017-10-16 06:36:26.640    00:06:45

我正在尝试将其更改为以十分钟为增量的时间序列，其中我记录了过去十分钟内每个事件的持续时间（没有事件时持续时间为零）-我期待的结果是看起来像这样：

    Start                ID     EventID  Duration
0   2017-10-16 06:10:00  G01    1001     0:06:22.560000
1   2017-10-16 06:20:00  G01    1001     00:10:00
2   2017-10-16 06:30:00  G01    1001     00:05:35.560000
3   2017-10-16 06:40:00  G01    1001     00:00:00
4   2017-10-16 06:50:00  G01    1001     00:00:00

（如果有一个很好的方法可以做到这一点，它只 returns 随着相关事件的增加，即不显示持续时间为 00:00:00 的行，这也很好）

这是我到目前为止生成的代码（为每个 ID 创建一个数据框）：

df1.set_index(df1['Start'], inplace = True)
df1.rename(columns={'Start':'Start_Time'}, inplace=True)
df1.index = df1.index.ceil('10min')

df2 = df1.where(df1['ID'] == 'G01').dropna()
df2 = df2.asfreq('10Min', method = 'pad').reset_index()     
for row in df2.itertuples():
    ten_min = df2.Start[1]-df2.Start[0]
    zero_min = df2.Start[1]-df2.Start[1]
    if row.Start > row.End and row.Start > row.Start_Time:
        if (row.Start - row.End) < ten_min:
            df2.loc[row.Index, 'Duration'] = row.Start - row.End
        else:
            df2.loc[row.Index, 'Duration'] = zero_min
    if row.Start < row.End:
        if (row.Start - row.Start_Time) < ten_min:
            df2.loc[row.Index, 'Duration'] = row.Start - row.Start_Time
        else:
            df2.loc[row.Index, 'Duration'] = ten_min

问题：

这适用于我的第一个 ID，但 G02 已经是一个问题，因为它有好几次事件在同一十分钟内开始和结束 - 我的 asfreq() 方法不适用于非唯一索引
计算需要很多时间 - 我正在寻找提高性能的建议。

欢迎任何反馈！

Answer 1

这是一个感觉有点笨拙的解决方案，但执行速度可能足以满足您的需求。

def get_durations(df_subset):
    '''A helper function to be passed to df.apply().'''
    # If each ID only has 1 row in the input DataFrame, then 
    # any of .min(), .max(), or .iloc[0] should work here
    t1 = df_subset['Start'].min()
    t2 = df_subset['End'].max()

    # Build a DatetimeIndex whose start and end values are the next 
    # available 10-minute tickmarks after t1 and t2
    idx = pd.date_range(t1.ceil('10min'), t2.ceil('10min'), freq='10min')

    # Calculate 10-minute durations with .diff(). Note that
    # idx.to_series() returns a series whose values are all equal to the 
    # corresponding values of its own DatetimeIndex. So dur.index is
    # what we will call 'Start' and dur.values will be 'Duration'
    dur = idx.to_series().diff()

    # Manually adjust the first and last durations
    dur[0] = idx[0] - t1
    dur[-1] = idx[-1] - t2

    dur.index.rename('Start', inplace=True)
    return dur


# Apply the above function to each ID in the input DataFrame
df.groupby(['ID', 'EventID']).apply(f).rename('Duration').to_frame().reset_index()

# Output:
     ID  EventID               Start        Duration
0   G01     1001 2017-10-16 06:10:00 00:06:22.560000
1   G01     1001 2017-10-16 06:20:00        00:10:00
2   G01     1001 2017-10-16 06:30:00 00:05:35.560000
3   G02     1001 2017-10-16 06:20:00 00:04:23.800000
4   G02     1001 2017-10-16 06:30:00 00:06:23.800000
5   G03     1001 2017-10-16 06:30:00 00:00:39.360000
6   G03     1001 2017-10-16 06:40:00        00:10:00
7   G03     1001 2017-10-16 06:50:00 00:02:39.360000
8   G05     1001 2017-10-16 06:30:00 00:00:18.360000
9   G05     1001 2017-10-16 06:40:00 00:03:33.360000
10  G06     1001 2017-10-16 06:20:00 00:01:38.840000
11  G06     1001 2017-10-16 06:30:00 00:06:23.880000
12  G07     1001 2017-10-16 06:20:00 00:08:55.400000
13  G07     1001 2017-10-16 06:30:00        00:10:00
14  G07     1001 2017-10-16 06:40:00        00:10:00
15  G07     1001 2017-10-16 06:50:00        00:10:00
16  G07     1001 2017-10-16 07:00:00        00:10:00
17  G07     1001 2017-10-16 07:10:00        00:10:00
18  G07     1001 2017-10-16 07:20:00        00:10:00
19  G07     1001 2017-10-16 07:30:00 00:01:16.480000

Answer 2

Peter 的回答对我帮助很大，我不得不采取一些变通办法，因为我在数据集中确实有每个 ID 的多个实例 + 一些在同一十分钟内开始和结束的事件。我还创建了一个新的 table，其中所有 ID 彼此相邻。这是代码（在 1.6 秒内运行）：

def get_durations(x):
    df2 = df1.where(df1['ID'] == x).dropna()
    duration = []
    index_dur = []
    for row in df2.itertuples():
        t1 = row.Start
        t2 = row.End
        idx = pd.date_range(t1.ceil('10min'), t2.ceil('10min'), freq='10min')
        dur = idx.to_series().diff()
        if len(dur) > 1:
            dur[0] = idx[0] - t1
            dur[-1] = t2 - idx[-2]
        else:
            dur[0] = t2-t1
        dur.index.rename('Start', inplace=True)
        duration.extend(dur)
        index_dur.extend(dur.index)

        df3 = pd.DataFrame(duration, index_dur).reset_index()
       df3.columns= ['Time', x]
    return pd.DataFrame(df3.groupby('Time')[x].sum()).reset_index()

IDs_list = df1['ID'].unique().tolist()
G01 = get_durations(turbines[0])
G07 = get_durations(turbines[1])
G02 = get_durations(turbines[2])
G06 = get_durations(turbines[3])
G03 = get_durations(turbines[4])
G05 = get_durations(turbines[5])
G04 = get_durations(turbines[6])

df4 = G01.merge(G02, how='outer', on='Time').merge(G03, how='outer', on='Time') on='Time')

输出为：

    Time                 G01               G02               G03
0   2017-10-16 06:10:00  00:06:22.560000   NaT               NaT    
1   2017-10-16 06:20:00  00:10:00          00:04:23.800000   NaT    
2   2017-10-16 06:30:00  00:04:24.440000   00:03:38          00:00:39.360000
3   2017-10-20 22:00:00  00:06:13.040000   NaT               NaT    
4   2017-10-21 12:30:00  00:05:17.960000   NaT               NaT    
5   2017-12-13 15:50:00  00:00:14.480000   NaT               NaT    
6   2017-12-13 16:00:00  00:02:57.520000   NaT               NaT    
7   2017-12-29 06:00:00  00:05:18          00:04:16.960000   00:04:48

现在可能还有改进的余地。我对 apply() 方法不是很好，我确信此时可以用它做一些事情，仍在采纳改进建议，否则这是有效的:)

Pandas - 从具有开始和结束日期和时间的事件数据帧创建一个 10 分钟的时间序列

Pandas - Create a 10-min time series from dataframe of events with start and end date and time

time-series

date-range