Pandas 行间数据帧计算
Pandas dataframe calculations between rows
我正在尝试读取日志并计算特定工作流程的持续时间。所以包含日志的数据框看起来像这样:
Timestamp Workflow Status
20:31:52 ABC Started
...
...
20:32:50 ABC Completed
为了计算持续时间,我正在使用以下代码:
start_time = log_text[(log_text['Workflow']=='ABC') & (log_text['Category']=='Started')]['Timestamp']
compl_time = log_text[(log_text['Workflow']=='ABC') & (log_text['Category']=='Completed')]['Timestamp']
duration = compl_time - start_time
我得到的答案是:
1 NaT
72 NaT
Name: Timestamp, dtype: timedelta64[ns]
我认为由于索引不同,所以时差计算不正确。当然,我可以通过以下方式明确使用每行的索引来获得正确答案:
duration = compl_time.loc[72] - start_time[1]
但这似乎是一种不优雅的做事方式。有没有更好的方法来完成同样的事情?
你是对的,不同的 indexes
有问题,所以输出无法对齐并得到 NaN
s。
最简单的是通过values
, but need same lenght of both Series
(here both are length == 1
), for selecting with boolean indexing
is better use loc
将输出转换为numpy array
:
mask = log_text['Workflow']=='ABC'
start_time = log_text.loc[mask & (log_text['Status']=='Started'), 'Timestamp']
compl_time = log_text.loc[mask & (log_text['Status']=='Completed'),'Timestamp']
print (len(start_time))
1
print (len(compl_time))
1
duration = compl_time - start_time.values
print (duration)
1 00:00:58
Name: Timestamp, dtype: timedelta64[ns]
duration = compl_time.values - start_time.values
print (pd.to_timedelta(duration))
TimedeltaIndex(['00:00:58'], dtype='timedelta64[ns]', freq=None)
print (pd.Series(pd.to_timedelta(duration)))
0 00:00:58
dtype: timedelta64[ns]
我正在尝试读取日志并计算特定工作流程的持续时间。所以包含日志的数据框看起来像这样:
Timestamp Workflow Status
20:31:52 ABC Started
...
...
20:32:50 ABC Completed
为了计算持续时间,我正在使用以下代码:
start_time = log_text[(log_text['Workflow']=='ABC') & (log_text['Category']=='Started')]['Timestamp']
compl_time = log_text[(log_text['Workflow']=='ABC') & (log_text['Category']=='Completed')]['Timestamp']
duration = compl_time - start_time
我得到的答案是:
1 NaT
72 NaT
Name: Timestamp, dtype: timedelta64[ns]
我认为由于索引不同,所以时差计算不正确。当然,我可以通过以下方式明确使用每行的索引来获得正确答案:
duration = compl_time.loc[72] - start_time[1]
但这似乎是一种不优雅的做事方式。有没有更好的方法来完成同样的事情?
你是对的,不同的 indexes
有问题,所以输出无法对齐并得到 NaN
s。
最简单的是通过values
, but need same lenght of both Series
(here both are length == 1
), for selecting with boolean indexing
is better use loc
将输出转换为numpy array
:
mask = log_text['Workflow']=='ABC'
start_time = log_text.loc[mask & (log_text['Status']=='Started'), 'Timestamp']
compl_time = log_text.loc[mask & (log_text['Status']=='Completed'),'Timestamp']
print (len(start_time))
1
print (len(compl_time))
1
duration = compl_time - start_time.values
print (duration)
1 00:00:58
Name: Timestamp, dtype: timedelta64[ns]
duration = compl_time.values - start_time.values
print (pd.to_timedelta(duration))
TimedeltaIndex(['00:00:58'], dtype='timedelta64[ns]', freq=None)
print (pd.Series(pd.to_timedelta(duration)))
0 00:00:58
dtype: timedelta64[ns]