使用 groupby 而不是 for 循环来确定在最后一个位置花费的时间

Use a groupby instead of for loop to determine time spent in last locations

我有一个数据框,记录人们从一个位置到另一个位置的移动。我想计算每个 person_id 在每个 end_location_id 不动时花费的时间。根据 location_id,计算基于下一个新旅行条目的 start_ts 与上一个旅行条目的 end_ts 之间的差异。如果每人没有找到新的旅行条目,那么计算应该是 current_timeend_ts.


我想用使用 pandas' groupby(或等效)函数的代码替换计算 duration 的代码。这可能吗?我现在的代码很乱。

import pandas as pd

current_time = '2022-05-05 17:00'

df = pd.DataFrame(
    "person_id": ["A", "B", "A", "C", "A", "C"],
    "start_location_id": [1, 5, 2, 7, 3, 8],
    "end_location_id": [2, 6, 3, 8, 2, 9],
    "start_ts": pd.to_datetime(["2022-05-05 00:00", "2022-05-05 00:00", "2022-05-05 05:00", "2022-05-05 00:00", "2022-05-05 13:00", "2022-05-05 11:00"]),
    "end_ts": pd.to_datetime(["2022-05-05 02:00", "2022-05-05 03:00", "2022-05-05 10:00", "2022-05-05 04:00", "2022-05-05 16:00", "2022-05-05  12:00"]),

for i in df.index:
    df.loc[i, "duration"] = (df.loc[(df["person_id"] == df.loc[i, "person_id"]) & (df["start_location_id"] == df.loc[i, "end_location_id"]) & (df.index > i), "start_ts"].min() - df.loc[i, "end_ts"]).total_seconds()/3600

df.loc[df["duration"].isna(), "duration"] = (pd.to_datetime(current_time) -                 
df.loc[df["duration"].isna(), "end_ts"]).dt.total_seconds()/3600


  person_id  start_location_id  end_location_id            start_ts  \
0         A                  1                2 2022-05-05 00:00:00   
1         B                  5                6 2022-05-05 00:00:00   
2         A                  2                3 2022-05-05 05:00:00   
3         C                  7                8 2022-05-05 00:00:00   
4         A                  3                2 2022-05-05 13:00:00   
5         C                  8                9 2022-05-05 11:00:00   

               end_ts  duration  
0 2022-05-05 02:00:00       3.0  
1 2022-05-05 03:00:00      14.0  
2 2022-05-05 10:00:00       3.0  
3 2022-05-05 04:00:00       7.0  
4 2022-05-05 16:00:00       1.0  
5 2022-05-05 12:00:00       5.0  

您可以按组轮换 (GroupBy.shift),用您当前的时间填充 NaN:

current_time = pd.Timestamp('2022-05-05 17:00') # works also with simple string

df['duration'] = (df
                  .shift(-1, fill_value=current_time)


  person_id  start_location_id  end_location_id            start_ts  \
0         A                  1                2 2022-05-05 00:00:00   
1         B                  5                6 2022-05-05 00:00:00   
2         A                  2                3 2022-05-05 05:00:00   
3         C                  7                8 2022-05-05 00:00:00   
4         A                  3                2 2022-05-05 13:00:00   
5         C                  8                9 2022-05-05 11:00:00   

               end_ts  duration  
0 2022-05-05 02:00:00       3.0  
1 2022-05-05 03:00:00      14.0  
2 2022-05-05 10:00:00       3.0  
3 2022-05-05 04:00:00       7.0  
4 2022-05-05 16:00:00       1.0  
5 2022-05-05 12:00:00       5.0