沿 pandas 数据框行传播计算
propagate calculation along pandas dataframe rows
我需要沿着 pandas 数据帧行传播计算(例如延迟)。
我找到了一个使用 .iterrows() 方法的解决方案并且速度很慢,所以我想知道是否有针对这个问题的矢量化解决方案,因为我的数据量很大。
这是我的方法:
import pandas as pd
import numpy as np
df = pd.DataFrame(index = ['task_1', 'task_2', 'task_3', 'task_4', 'task_5'], columns=['start_time', 'end_time'], data = [[1,2], [3,4], [6,7], [7,8], [10,11] ] )
# set start delay on task 2
start_delay_on_task_2 = 3
df.loc['task_2', 'start_delay'] = start_delay_on_task_2
df['start_delay'].fillna(0, inplace=True)
# compute buffer between tasks
df['buffer_to_next_task'] = df['start_time'].shift(-1) - df['end_time']
这里是df的内容:
start_time end_time
task_1 1 2
task_2 3 4
task_3 6 7
task_4 7 8
task_5 10 11
现在是计算总延迟的最差代码
df['overall_start_delay'] = df['start_delay']
overall_start_delay_idx = df.columns.get_loc('overall_start_delay')
start_delay_idx = df.columns.get_loc('start_delay')
buffer_to_next_task_idx = df.columns.get_loc('buffer_to_next_task')
for i in range(len(df)):
overall_delay = None
if list(df.index)[i] <= 'task_2':
overall_delay = df.iloc[i, start_delay_idx]
else:
overall_delay = max(0, df.iloc[i-1, overall_start_delay_idx] - df.iloc[i-1, buffer_to_next_task_idx])
df.iloc[i, overall_start_delay_idx] = overall_delay
这里是想要的结果
start_time end_time start_delay buffer_to_next_task overall_start_delay
task_1 1 2 0.0 1.0 0.0
task_2 3 4 3.0 2.0 3.0
task_3 6 7 0.0 0.0 1.0
task_4 7 8 0.0 2.0 1.0
task_5 10 11 0.0 NaN 0.0
关于使此代码矢量化并避免 for 循环的任何建议?
这是一个延迟的解决方案:
total_delays = df.start_delay.cumsum()
(total_delays
.sub(df.buffer_to_next_task
.where(total_delays.gt(0),0)
.cumsum().shift(fill_value=0)
)
.clip(lower=0)
)
输出:
task_1 0.0
task_2 3.0
task_3 1.0
task_4 1.0
task_5 0.0
dtype: float64
我需要沿着 pandas 数据帧行传播计算(例如延迟)。
我找到了一个使用 .iterrows() 方法的解决方案并且速度很慢,所以我想知道是否有针对这个问题的矢量化解决方案,因为我的数据量很大。
这是我的方法:
import pandas as pd
import numpy as np
df = pd.DataFrame(index = ['task_1', 'task_2', 'task_3', 'task_4', 'task_5'], columns=['start_time', 'end_time'], data = [[1,2], [3,4], [6,7], [7,8], [10,11] ] )
# set start delay on task 2
start_delay_on_task_2 = 3
df.loc['task_2', 'start_delay'] = start_delay_on_task_2
df['start_delay'].fillna(0, inplace=True)
# compute buffer between tasks
df['buffer_to_next_task'] = df['start_time'].shift(-1) - df['end_time']
这里是df的内容:
start_time end_time
task_1 1 2
task_2 3 4
task_3 6 7
task_4 7 8
task_5 10 11
现在是计算总延迟的最差代码
df['overall_start_delay'] = df['start_delay']
overall_start_delay_idx = df.columns.get_loc('overall_start_delay')
start_delay_idx = df.columns.get_loc('start_delay')
buffer_to_next_task_idx = df.columns.get_loc('buffer_to_next_task')
for i in range(len(df)):
overall_delay = None
if list(df.index)[i] <= 'task_2':
overall_delay = df.iloc[i, start_delay_idx]
else:
overall_delay = max(0, df.iloc[i-1, overall_start_delay_idx] - df.iloc[i-1, buffer_to_next_task_idx])
df.iloc[i, overall_start_delay_idx] = overall_delay
这里是想要的结果
start_time end_time start_delay buffer_to_next_task overall_start_delay
task_1 1 2 0.0 1.0 0.0
task_2 3 4 3.0 2.0 3.0
task_3 6 7 0.0 0.0 1.0
task_4 7 8 0.0 2.0 1.0
task_5 10 11 0.0 NaN 0.0
关于使此代码矢量化并避免 for 循环的任何建议?
这是一个延迟的解决方案:
total_delays = df.start_delay.cumsum()
(total_delays
.sub(df.buffer_to_next_task
.where(total_delays.gt(0),0)
.cumsum().shift(fill_value=0)
)
.clip(lower=0)
)
输出:
task_1 0.0
task_2 3.0
task_3 1.0
task_4 1.0
task_5 0.0
dtype: float64