计算 pandas 中多个场景的持续时间
Compute duration for multiple scenarios in pandas
我有一个包含多个 ID 的数据帧,我想通过某个滑动 window 帧对其进行切片,并计算出现在 window 中的每个 ID 的持续时间。某些时间片只有一个id,而其他时间片有多个id。
对于出现多个 ID 的情况,我可以如下捕获每个 ID 的持续时间。
具有多个 ID 的数据框
id,date,value
1,2012-01-01 00:09:45,1
1,2012-01-01 00:09:50,1
2,2012-01-01 00:09:55,1
2,2012-01-01 00:10:00,1
2,2012-01-01 00:30:10,1
2,2012-01-01 00:30:15,1
3,2012-01-01 00:30:20,1
3,2012-01-01 00:30:25,1
3,2012-01-01 00:30:30,1
1,2012-01-01 00:30:45,1
import pandas as pd
df = pd.read_csv('df.csv')
df['date'] = pd.to_datetime(df['date'])
diff_ids = df['id'] != df['id'].shift(1)
df = df[diff_ids].copy()
df['start'] = df['date']
df['end'] = df['date'].shift(-1)
df['duration'] = df['end'] - df['start']
print(df)
输出
id date value start end duration
1 2012-01-01 00:09:45 1 2012-01-01 00:09:45 2012-01-01 00:09:55 00:00:10
2 2012-01-01 00:09:55 1 2012-01-01 00:09:55 2012-01-01 00:30:20 00:20:25
3 2012-01-01 00:30:20 1 2012-01-01 00:30:20 2012-01-01 00:30:45 00:00:25
1 2012-01-01 00:30:45 1 2012-01-01 00:30:45 NaT NaT
按照上面同样的逻辑,下面只出现一个id的情况如何也可以解决
具有单个 id 的数据框
id,date,value
2,2012-01-01 00:09:45,1
2,2012-01-01 00:09:50,1
2,2012-01-01 00:09:55,1
2,2012-01-01 00:10:00,1
2,2012-01-01 00:30:10,1
2,2012-01-01 00:30:15,1
2,2012-01-01 00:30:20,1
2,2012-01-01 00:30:25,1
2,2012-01-01 00:30:30,1
2,2012-01-01 00:30:45,1
预期输出:
id date value start end duration
2 2012-01-01 00:09:45 1 2012-01-01 00:09:45 2012-01-01 00:30:45 00:21:10
如果只有一个ID,你可以这样做:
>>> df.sort_values("date").head(1).assign(start=df["date"].min(), end= df["date"].max(), duration=df["date"].max()-df["date"].min())
id date value start end duration
2 2012-01-01 00:09:45 1 2012-01-01 00:09:45 2012-01-01 00:30:45 0 days 00:21:00
我有一个包含多个 ID 的数据帧,我想通过某个滑动 window 帧对其进行切片,并计算出现在 window 中的每个 ID 的持续时间。某些时间片只有一个id,而其他时间片有多个id。
对于出现多个 ID 的情况,我可以如下捕获每个 ID 的持续时间。
具有多个 ID 的数据框
id,date,value
1,2012-01-01 00:09:45,1
1,2012-01-01 00:09:50,1
2,2012-01-01 00:09:55,1
2,2012-01-01 00:10:00,1
2,2012-01-01 00:30:10,1
2,2012-01-01 00:30:15,1
3,2012-01-01 00:30:20,1
3,2012-01-01 00:30:25,1
3,2012-01-01 00:30:30,1
1,2012-01-01 00:30:45,1
import pandas as pd
df = pd.read_csv('df.csv')
df['date'] = pd.to_datetime(df['date'])
diff_ids = df['id'] != df['id'].shift(1)
df = df[diff_ids].copy()
df['start'] = df['date']
df['end'] = df['date'].shift(-1)
df['duration'] = df['end'] - df['start']
print(df)
输出
id date value start end duration
1 2012-01-01 00:09:45 1 2012-01-01 00:09:45 2012-01-01 00:09:55 00:00:10
2 2012-01-01 00:09:55 1 2012-01-01 00:09:55 2012-01-01 00:30:20 00:20:25
3 2012-01-01 00:30:20 1 2012-01-01 00:30:20 2012-01-01 00:30:45 00:00:25
1 2012-01-01 00:30:45 1 2012-01-01 00:30:45 NaT NaT
按照上面同样的逻辑,下面只出现一个id的情况如何也可以解决
具有单个 id 的数据框
id,date,value
2,2012-01-01 00:09:45,1
2,2012-01-01 00:09:50,1
2,2012-01-01 00:09:55,1
2,2012-01-01 00:10:00,1
2,2012-01-01 00:30:10,1
2,2012-01-01 00:30:15,1
2,2012-01-01 00:30:20,1
2,2012-01-01 00:30:25,1
2,2012-01-01 00:30:30,1
2,2012-01-01 00:30:45,1
预期输出:
id date value start end duration
2 2012-01-01 00:09:45 1 2012-01-01 00:09:45 2012-01-01 00:30:45 00:21:10
如果只有一个ID,你可以这样做:
>>> df.sort_values("date").head(1).assign(start=df["date"].min(), end= df["date"].max(), duration=df["date"].max()-df["date"].min())
id date value start end duration
2 2012-01-01 00:09:45 1 2012-01-01 00:09:45 2012-01-01 00:30:45 0 days 00:21:00