根据旅行信息在 Python 中的数据框，创建一个包含在某个位置花费的时间的数据框

Question

我有一个数据框 df 记录人们从一个位置到另一个位置的移动。我想创建一个新的数据框 df2，根据 df.

中的行程信息记录输入位置和离开同一位置的时间

import pandas as pd
import numpy as np

current_time = '2022-05-05 17:00'

df = pd.DataFrame({"person_id": ["A", "B", "A", "C", "A", "C"],
                   "start_location_id": [1, 5, 2, 7, 3, 8],
                   "end_location_id": [2, 6, 3, 8, 2, 9],
                   "start_ts": pd.to_datetime(["2022-05-05 00:00", "2022-05-05 00:00", "2022-05-05 05:00", "2022-05-05 00:00", "2022-05-05 13:00", "2022-05-05 11:00"]),
                   "end_ts": pd.to_datetime(["2022-05-05 02:00", "2022-05-05 03:00", "2022-05-05 10:00", "2022-05-05 04:00", "2022-05-05 16:00", "2022-05-05  12:00"]),
})

df 看起来像这样：

print(df.to_string())
>>
  person_id  start_location_id  end_location_id            start_ts              end_ts
0         A                  1                2 2022-05-05 00:00:00 2022-05-05 02:00:00
1         B                  5                6 2022-05-05 00:00:00 2022-05-05 03:00:00
2         A                  2                3 2022-05-05 05:00:00 2022-05-05 10:00:00
3         C                  7                8 2022-05-05 00:00:00 2022-05-05 04:00:00
4         A                  3                2 2022-05-05 13:00:00 2022-05-05 16:00:00
5         C                  8                9 2022-05-05 11:00:00 2022-05-05 12:00:00

然后我使用以下嵌套 for 循环创建 df2。我想用使用 pandas' groupby（或等效）函数的代码替换设置 df2 的代码。我目前的尝试如下：

df2 = pd.DataFrame(columns = ["person_id", "location_id", "enter_timestamp", "exit_timestamp"])

for i in np.unique(df["person_id"]):
    df3 = df.loc[df["person_id"] == i].reset_index(drop = True)
    for j in df3.index:
        try:
            df2 = df2.append({'person_id': i,
                              'location_id' : df3.loc[j, "end_location_id"],
                              'enter_timestamp' : df3.loc[j, "end_ts"],
                              'exit_timestamp' : df3.loc[j + 1, "start_ts"]}, ignore_index=True)
        except:
            df2 = df2.append({'person_id': i,
                          'location_id' : df3.loc[j, "end_location_id"],
                          'enter_timestamp' : df3.loc[j, "end_ts"]}, ignore_index=True)

我想要的输出是：

print(df2.to_string())
>>
  person_id location_id     enter_timestamp      exit_timestamp
0         A           2 2022-05-05 02:00:00 2022-05-05 05:00:00
1         A           3 2022-05-05 10:00:00 2022-05-05 13:00:00
2         A           2 2022-05-05 16:00:00                 NaT
3         B           6 2022-05-05 03:00:00                 NaT
4         C           8 2022-05-05 04:00:00 2022-05-05 11:00:00
5         C           9 2022-05-05 12:00:00                 NaT

我怎样才能做到这一点？谢谢

Answer 1

首先按 person_id 列 DataFrame.sort_values, then create column exit_timestamp with DataFrameGroupBy.shift、rename 列对行进行排序，对于最终预期的列，按列表 cols:

过滤它们

cols = ["person_id", "location_id", "enter_timestamp", "exit_timestamp"]
df = df.sort_values('person_id', ignore_index=True)
df['exit_timestamp'] = df.groupby('person_id')['start_ts'].shift(-1)
df = df.rename(columns={'end_location_id':'location_id', 'end_ts':'enter_timestamp'})[cols]
print (df)
  person_id  location_id     enter_timestamp      exit_timestamp
0         A            2 2022-05-05 02:00:00 2022-05-05 05:00:00
1         A            3 2022-05-05 10:00:00 2022-05-05 13:00:00
2         A            2 2022-05-05 16:00:00                 NaT
3         B            6 2022-05-05 03:00:00                 NaT
4         C            8 2022-05-05 04:00:00 2022-05-05 11:00:00
5         C            9 2022-05-05 12:00:00                 NaT

根据旅行信息在 Python 中的数据框，创建一个包含在某个位置花费的时间的数据框

Create a dataframe containing time spent in location based on a dataframe with trip information in Python

python

group-by

dataframe

pandas