使用 groupby 而不是 for 循环来确定在最后一个位置花费的时间
Use a groupby instead of for loop to determine time spent in last locations
我有一个数据框,记录人们从一个位置到另一个位置的移动。我想计算每个 person_id
在每个 end_location_id
不动时花费的时间。根据 location_id
,计算基于下一个新旅行条目的 start_ts
与上一个旅行条目的 end_ts
之间的差异。如果每人没有找到新的旅行条目,那么计算应该是 current_time
和 end_ts
.
之间的差值
我想用使用 pandas' groupby(或等效)函数的代码替换计算 duration
的代码。这可能吗?我现在的代码很乱。
import pandas as pd
current_time = '2022-05-05 17:00'
df = pd.DataFrame(
{
"person_id": ["A", "B", "A", "C", "A", "C"],
"start_location_id": [1, 5, 2, 7, 3, 8],
"end_location_id": [2, 6, 3, 8, 2, 9],
"start_ts": pd.to_datetime(["2022-05-05 00:00", "2022-05-05 00:00", "2022-05-05 05:00", "2022-05-05 00:00", "2022-05-05 13:00", "2022-05-05 11:00"]),
"end_ts": pd.to_datetime(["2022-05-05 02:00", "2022-05-05 03:00", "2022-05-05 10:00", "2022-05-05 04:00", "2022-05-05 16:00", "2022-05-05 12:00"]),
}
)
for i in df.index:
df.loc[i, "duration"] = (df.loc[(df["person_id"] == df.loc[i, "person_id"]) & (df["start_location_id"] == df.loc[i, "end_location_id"]) & (df.index > i), "start_ts"].min() - df.loc[i, "end_ts"]).total_seconds()/3600
df.loc[df["duration"].isna(), "duration"] = (pd.to_datetime(current_time) -
df.loc[df["duration"].isna(), "end_ts"]).dt.total_seconds()/3600
我想要的输出是:
print(df)
>>
person_id start_location_id end_location_id start_ts \
0 A 1 2 2022-05-05 00:00:00
1 B 5 6 2022-05-05 00:00:00
2 A 2 3 2022-05-05 05:00:00
3 C 7 8 2022-05-05 00:00:00
4 A 3 2 2022-05-05 13:00:00
5 C 8 9 2022-05-05 11:00:00
end_ts duration
0 2022-05-05 02:00:00 3.0
1 2022-05-05 03:00:00 14.0
2 2022-05-05 10:00:00 3.0
3 2022-05-05 04:00:00 7.0
4 2022-05-05 16:00:00 1.0
5 2022-05-05 12:00:00 5.0
您可以按组轮换 (GroupBy.shift
),用您当前的时间填充 NaN:
current_time = pd.Timestamp('2022-05-05 17:00') # works also with simple string
df['duration'] = (df
.groupby('person_id')['start_ts']
.shift(-1, fill_value=current_time)
.sub(df['end_ts'])
.dt.total_seconds().div(3600)
)
输出:
person_id start_location_id end_location_id start_ts \
0 A 1 2 2022-05-05 00:00:00
1 B 5 6 2022-05-05 00:00:00
2 A 2 3 2022-05-05 05:00:00
3 C 7 8 2022-05-05 00:00:00
4 A 3 2 2022-05-05 13:00:00
5 C 8 9 2022-05-05 11:00:00
end_ts duration
0 2022-05-05 02:00:00 3.0
1 2022-05-05 03:00:00 14.0
2 2022-05-05 10:00:00 3.0
3 2022-05-05 04:00:00 7.0
4 2022-05-05 16:00:00 1.0
5 2022-05-05 12:00:00 5.0
我有一个数据框,记录人们从一个位置到另一个位置的移动。我想计算每个 person_id
在每个 end_location_id
不动时花费的时间。根据 location_id
,计算基于下一个新旅行条目的 start_ts
与上一个旅行条目的 end_ts
之间的差异。如果每人没有找到新的旅行条目,那么计算应该是 current_time
和 end_ts
.
我想用使用 pandas' groupby(或等效)函数的代码替换计算 duration
的代码。这可能吗?我现在的代码很乱。
import pandas as pd
current_time = '2022-05-05 17:00'
df = pd.DataFrame(
{
"person_id": ["A", "B", "A", "C", "A", "C"],
"start_location_id": [1, 5, 2, 7, 3, 8],
"end_location_id": [2, 6, 3, 8, 2, 9],
"start_ts": pd.to_datetime(["2022-05-05 00:00", "2022-05-05 00:00", "2022-05-05 05:00", "2022-05-05 00:00", "2022-05-05 13:00", "2022-05-05 11:00"]),
"end_ts": pd.to_datetime(["2022-05-05 02:00", "2022-05-05 03:00", "2022-05-05 10:00", "2022-05-05 04:00", "2022-05-05 16:00", "2022-05-05 12:00"]),
}
)
for i in df.index:
df.loc[i, "duration"] = (df.loc[(df["person_id"] == df.loc[i, "person_id"]) & (df["start_location_id"] == df.loc[i, "end_location_id"]) & (df.index > i), "start_ts"].min() - df.loc[i, "end_ts"]).total_seconds()/3600
df.loc[df["duration"].isna(), "duration"] = (pd.to_datetime(current_time) -
df.loc[df["duration"].isna(), "end_ts"]).dt.total_seconds()/3600
我想要的输出是:
print(df)
>>
person_id start_location_id end_location_id start_ts \
0 A 1 2 2022-05-05 00:00:00
1 B 5 6 2022-05-05 00:00:00
2 A 2 3 2022-05-05 05:00:00
3 C 7 8 2022-05-05 00:00:00
4 A 3 2 2022-05-05 13:00:00
5 C 8 9 2022-05-05 11:00:00
end_ts duration
0 2022-05-05 02:00:00 3.0
1 2022-05-05 03:00:00 14.0
2 2022-05-05 10:00:00 3.0
3 2022-05-05 04:00:00 7.0
4 2022-05-05 16:00:00 1.0
5 2022-05-05 12:00:00 5.0
您可以按组轮换 (GroupBy.shift
),用您当前的时间填充 NaN:
current_time = pd.Timestamp('2022-05-05 17:00') # works also with simple string
df['duration'] = (df
.groupby('person_id')['start_ts']
.shift(-1, fill_value=current_time)
.sub(df['end_ts'])
.dt.total_seconds().div(3600)
)
输出:
person_id start_location_id end_location_id start_ts \
0 A 1 2 2022-05-05 00:00:00
1 B 5 6 2022-05-05 00:00:00
2 A 2 3 2022-05-05 05:00:00
3 C 7 8 2022-05-05 00:00:00
4 A 3 2 2022-05-05 13:00:00
5 C 8 9 2022-05-05 11:00:00
end_ts duration
0 2022-05-05 02:00:00 3.0
1 2022-05-05 03:00:00 14.0
2 2022-05-05 10:00:00 3.0
3 2022-05-05 04:00:00 7.0
4 2022-05-05 16:00:00 1.0
5 2022-05-05 12:00:00 5.0