使用 groupby 而不是 for 循环来确定在最后一个位置花费的时间

Question

我有一个数据框，记录人们从一个位置到另一个位置的移动。我想计算每个 person_id 在每个 end_location_id 不动时花费的时间。根据 location_id，计算基于下一个新旅行条目的 start_ts 与上一个旅行条目的 end_ts 之间的差异。如果每人没有找到新的旅行条目，那么计算应该是 current_time 和 end_ts.

之间的差值

我想用使用 pandas' groupby（或等效）函数的代码替换计算 duration 的代码。这可能吗？我现在的代码很乱。

import pandas as pd

current_time = '2022-05-05 17:00'

df = pd.DataFrame(
{
    "person_id": ["A", "B", "A", "C", "A", "C"],
    "start_location_id": [1, 5, 2, 7, 3, 8],
    "end_location_id": [2, 6, 3, 8, 2, 9],
    "start_ts": pd.to_datetime(["2022-05-05 00:00", "2022-05-05 00:00", "2022-05-05 05:00", "2022-05-05 00:00", "2022-05-05 13:00", "2022-05-05 11:00"]),
    "end_ts": pd.to_datetime(["2022-05-05 02:00", "2022-05-05 03:00", "2022-05-05 10:00", "2022-05-05 04:00", "2022-05-05 16:00", "2022-05-05  12:00"]),
}
)

for i in df.index:
    df.loc[i, "duration"] = (df.loc[(df["person_id"] == df.loc[i, "person_id"]) & (df["start_location_id"] == df.loc[i, "end_location_id"]) & (df.index > i), "start_ts"].min() - df.loc[i, "end_ts"]).total_seconds()/3600

df.loc[df["duration"].isna(), "duration"] = (pd.to_datetime(current_time) -                 
df.loc[df["duration"].isna(), "end_ts"]).dt.total_seconds()/3600

我想要的输出是：

print(df)
>>
  person_id  start_location_id  end_location_id            start_ts  \
0         A                  1                2 2022-05-05 00:00:00   
1         B                  5                6 2022-05-05 00:00:00   
2         A                  2                3 2022-05-05 05:00:00   
3         C                  7                8 2022-05-05 00:00:00   
4         A                  3                2 2022-05-05 13:00:00   
5         C                  8                9 2022-05-05 11:00:00   

               end_ts  duration  
0 2022-05-05 02:00:00       3.0  
1 2022-05-05 03:00:00      14.0  
2 2022-05-05 10:00:00       3.0  
3 2022-05-05 04:00:00       7.0  
4 2022-05-05 16:00:00       1.0  
5 2022-05-05 12:00:00       5.0

Answer 1

您可以按组轮换 (GroupBy.shift)，用您当前的时间填充 NaN：

current_time = pd.Timestamp('2022-05-05 17:00') # works also with simple string

df['duration'] = (df
                  .groupby('person_id')['start_ts']
                  .shift(-1, fill_value=current_time)
                  .sub(df['end_ts'])
                  .dt.total_seconds().div(3600)
                 )

输出：

  person_id  start_location_id  end_location_id            start_ts  \
0         A                  1                2 2022-05-05 00:00:00   
1         B                  5                6 2022-05-05 00:00:00   
2         A                  2                3 2022-05-05 05:00:00   
3         C                  7                8 2022-05-05 00:00:00   
4         A                  3                2 2022-05-05 13:00:00   
5         C                  8                9 2022-05-05 11:00:00   

               end_ts  duration  
0 2022-05-05 02:00:00       3.0  
1 2022-05-05 03:00:00      14.0  
2 2022-05-05 10:00:00       3.0  
3 2022-05-05 04:00:00       7.0  
4 2022-05-05 16:00:00       1.0  
5 2022-05-05 12:00:00       5.0

使用 groupby 而不是 for 循环来确定在最后一个位置花费的时间

Use a groupby instead of for loop to determine time spent in last locations

python

group-by

dataframe

pandas