根据旅行信息在 Python 中的数据框,创建一个包含在某个位置花费的时间的数据框
Create a dataframe containing time spent in location based on a dataframe with trip information in Python
我有一个数据框 df
记录人们从一个位置到另一个位置的移动。我想创建一个新的数据框 df2
,根据 df
.
中的行程信息记录输入位置和离开同一位置的时间
import pandas as pd
import numpy as np
current_time = '2022-05-05 17:00'
df = pd.DataFrame({"person_id": ["A", "B", "A", "C", "A", "C"],
"start_location_id": [1, 5, 2, 7, 3, 8],
"end_location_id": [2, 6, 3, 8, 2, 9],
"start_ts": pd.to_datetime(["2022-05-05 00:00", "2022-05-05 00:00", "2022-05-05 05:00", "2022-05-05 00:00", "2022-05-05 13:00", "2022-05-05 11:00"]),
"end_ts": pd.to_datetime(["2022-05-05 02:00", "2022-05-05 03:00", "2022-05-05 10:00", "2022-05-05 04:00", "2022-05-05 16:00", "2022-05-05 12:00"]),
})
df
看起来像这样:
print(df.to_string())
>>
person_id start_location_id end_location_id start_ts end_ts
0 A 1 2 2022-05-05 00:00:00 2022-05-05 02:00:00
1 B 5 6 2022-05-05 00:00:00 2022-05-05 03:00:00
2 A 2 3 2022-05-05 05:00:00 2022-05-05 10:00:00
3 C 7 8 2022-05-05 00:00:00 2022-05-05 04:00:00
4 A 3 2 2022-05-05 13:00:00 2022-05-05 16:00:00
5 C 8 9 2022-05-05 11:00:00 2022-05-05 12:00:00
然后我使用以下嵌套 for 循环创建 df2
。我想用使用 pandas' groupby(或等效)函数的代码替换设置 df2
的代码。我目前的尝试如下:
df2 = pd.DataFrame(columns = ["person_id", "location_id", "enter_timestamp", "exit_timestamp"])
for i in np.unique(df["person_id"]):
df3 = df.loc[df["person_id"] == i].reset_index(drop = True)
for j in df3.index:
try:
df2 = df2.append({'person_id': i,
'location_id' : df3.loc[j, "end_location_id"],
'enter_timestamp' : df3.loc[j, "end_ts"],
'exit_timestamp' : df3.loc[j + 1, "start_ts"]}, ignore_index=True)
except:
df2 = df2.append({'person_id': i,
'location_id' : df3.loc[j, "end_location_id"],
'enter_timestamp' : df3.loc[j, "end_ts"]}, ignore_index=True)
我想要的输出是:
print(df2.to_string())
>>
person_id location_id enter_timestamp exit_timestamp
0 A 2 2022-05-05 02:00:00 2022-05-05 05:00:00
1 A 3 2022-05-05 10:00:00 2022-05-05 13:00:00
2 A 2 2022-05-05 16:00:00 NaT
3 B 6 2022-05-05 03:00:00 NaT
4 C 8 2022-05-05 04:00:00 2022-05-05 11:00:00
5 C 9 2022-05-05 12:00:00 NaT
我怎样才能做到这一点?谢谢
首先按 person_id
列 DataFrame.sort_values
, then create column exit_timestamp
with DataFrameGroupBy.shift
、rename
列对行进行排序,对于最终预期的列,按列表 cols
:
过滤它们
cols = ["person_id", "location_id", "enter_timestamp", "exit_timestamp"]
df = df.sort_values('person_id', ignore_index=True)
df['exit_timestamp'] = df.groupby('person_id')['start_ts'].shift(-1)
df = df.rename(columns={'end_location_id':'location_id', 'end_ts':'enter_timestamp'})[cols]
print (df)
person_id location_id enter_timestamp exit_timestamp
0 A 2 2022-05-05 02:00:00 2022-05-05 05:00:00
1 A 3 2022-05-05 10:00:00 2022-05-05 13:00:00
2 A 2 2022-05-05 16:00:00 NaT
3 B 6 2022-05-05 03:00:00 NaT
4 C 8 2022-05-05 04:00:00 2022-05-05 11:00:00
5 C 9 2022-05-05 12:00:00 NaT
我有一个数据框 df
记录人们从一个位置到另一个位置的移动。我想创建一个新的数据框 df2
,根据 df
.
import pandas as pd
import numpy as np
current_time = '2022-05-05 17:00'
df = pd.DataFrame({"person_id": ["A", "B", "A", "C", "A", "C"],
"start_location_id": [1, 5, 2, 7, 3, 8],
"end_location_id": [2, 6, 3, 8, 2, 9],
"start_ts": pd.to_datetime(["2022-05-05 00:00", "2022-05-05 00:00", "2022-05-05 05:00", "2022-05-05 00:00", "2022-05-05 13:00", "2022-05-05 11:00"]),
"end_ts": pd.to_datetime(["2022-05-05 02:00", "2022-05-05 03:00", "2022-05-05 10:00", "2022-05-05 04:00", "2022-05-05 16:00", "2022-05-05 12:00"]),
})
df
看起来像这样:
print(df.to_string())
>>
person_id start_location_id end_location_id start_ts end_ts
0 A 1 2 2022-05-05 00:00:00 2022-05-05 02:00:00
1 B 5 6 2022-05-05 00:00:00 2022-05-05 03:00:00
2 A 2 3 2022-05-05 05:00:00 2022-05-05 10:00:00
3 C 7 8 2022-05-05 00:00:00 2022-05-05 04:00:00
4 A 3 2 2022-05-05 13:00:00 2022-05-05 16:00:00
5 C 8 9 2022-05-05 11:00:00 2022-05-05 12:00:00
然后我使用以下嵌套 for 循环创建 df2
。我想用使用 pandas' groupby(或等效)函数的代码替换设置 df2
的代码。我目前的尝试如下:
df2 = pd.DataFrame(columns = ["person_id", "location_id", "enter_timestamp", "exit_timestamp"])
for i in np.unique(df["person_id"]):
df3 = df.loc[df["person_id"] == i].reset_index(drop = True)
for j in df3.index:
try:
df2 = df2.append({'person_id': i,
'location_id' : df3.loc[j, "end_location_id"],
'enter_timestamp' : df3.loc[j, "end_ts"],
'exit_timestamp' : df3.loc[j + 1, "start_ts"]}, ignore_index=True)
except:
df2 = df2.append({'person_id': i,
'location_id' : df3.loc[j, "end_location_id"],
'enter_timestamp' : df3.loc[j, "end_ts"]}, ignore_index=True)
我想要的输出是:
print(df2.to_string())
>>
person_id location_id enter_timestamp exit_timestamp
0 A 2 2022-05-05 02:00:00 2022-05-05 05:00:00
1 A 3 2022-05-05 10:00:00 2022-05-05 13:00:00
2 A 2 2022-05-05 16:00:00 NaT
3 B 6 2022-05-05 03:00:00 NaT
4 C 8 2022-05-05 04:00:00 2022-05-05 11:00:00
5 C 9 2022-05-05 12:00:00 NaT
我怎样才能做到这一点?谢谢
首先按 person_id
列 DataFrame.sort_values
, then create column exit_timestamp
with DataFrameGroupBy.shift
、rename
列对行进行排序,对于最终预期的列,按列表 cols
:
cols = ["person_id", "location_id", "enter_timestamp", "exit_timestamp"]
df = df.sort_values('person_id', ignore_index=True)
df['exit_timestamp'] = df.groupby('person_id')['start_ts'].shift(-1)
df = df.rename(columns={'end_location_id':'location_id', 'end_ts':'enter_timestamp'})[cols]
print (df)
person_id location_id enter_timestamp exit_timestamp
0 A 2 2022-05-05 02:00:00 2022-05-05 05:00:00
1 A 3 2022-05-05 10:00:00 2022-05-05 13:00:00
2 A 2 2022-05-05 16:00:00 NaT
3 B 6 2022-05-05 03:00:00 NaT
4 C 8 2022-05-05 04:00:00 2022-05-05 11:00:00
5 C 9 2022-05-05 12:00:00 NaT