Pandas 行间数据帧计算

Pandas dataframe calculations between rows

我正在尝试读取日志并计算特定工作流程的持续时间。所以包含日志的数据框看起来像这样:

Timestamp    Workflow    Status
20:31:52     ABC         Started
...
...
20:32:50     ABC         Completed

为了计算持续时间,我正在使用以下代码:

start_time = log_text[(log_text['Workflow']=='ABC') & (log_text['Category']=='Started')]['Timestamp']
compl_time = log_text[(log_text['Workflow']=='ABC') & (log_text['Category']=='Completed')]['Timestamp']
duration = compl_time - start_time

我得到的答案是:

1    NaT
72   NaT
Name: Timestamp, dtype: timedelta64[ns]

我认为由于索引不同,所以时差计算不正确。当然,我可以通过以下方式明确使用每行的索引来获得正确答案:

duration = compl_time.loc[72] - start_time[1]

但这似乎是一种不优雅的做事方式。有没有更好的方法来完成同样的事情?

你是对的,不同的 indexes 有问题,所以输出无法对齐并得到 NaNs。

最简单的是通过values, but need same lenght of both Series (here both are length == 1), for selecting with boolean indexing is better use loc将输出转换为numpy array:

mask = log_text['Workflow']=='ABC'
start_time = log_text.loc[mask & (log_text['Status']=='Started'), 'Timestamp']
compl_time = log_text.loc[mask & (log_text['Status']=='Completed'),'Timestamp']

print (len(start_time))
1
print (len(compl_time))
1

duration = compl_time - start_time.values

print (duration)
1   00:00:58
Name: Timestamp, dtype: timedelta64[ns]

duration = compl_time.values - start_time.values

print (pd.to_timedelta(duration))
TimedeltaIndex(['00:00:58'], dtype='timedelta64[ns]', freq=None)

print (pd.Series(pd.to_timedelta(duration)))
0   00:00:58
dtype: timedelta64[ns]