Pandas 根据某些条件计算时间增量

Pandas calculate time deltas based on some conditions

拥有以下 DF 的移动用户 activity:

    user_id                timestamp  wifi
0         1  2021-11-23 11:00:00.000     1
1         1  2021-11-23 11:01:00.000     1
2         1  2021-11-23 11:02:00.000     1
3         1  2021-11-23 11:10:00.000     1
4         1  2021-11-23 11:11:00.000     0
5         1  2021-11-23 11:22:00.000     0
6         2  2021-11-23 11:40:00.000     1
7         2  2021-11-23 11:41:00.000     1
8         2  2021-11-23 11:42:00.000     1
9         2  2021-11-23 11:43:00.000     0
10        2  2021-11-23 11:44:00.000     0
11        2  2021-11-23 11:48:00.000     0

user_id: 用户识别
timestamp: 日志时间
wifi:布尔 wifi 或蜂窝网络使用情况


我想计算 wifi 和蜂窝网络连接的时间使用情况,但有以下限制:

  1. 持续使用由相隔不到 5 分钟的两行定义。
  2. 没有连续事件的行将不被计算在内。

结果应该如下: 为简单起见,我在花费的时间列中填充了一个描述经过分钟数的数字。实际值应该是时间增量。

    user_id                timestamp  wifi  wifi_time_spent  cell_time_spent
0         1  2021-11-23 11:00:00.000     1                0                0
1         1  2021-11-23 11:01:00.000     1                1                0
2         1  2021-11-23 11:02:00.000     1                2                0
------------------------------ more then 5 min -----------------------------
3         1  2021-11-23 11:10:00.000     1                2                0 <-- single, not adding.
---------------- only 1 event before switching to cellular  ----------------
4         1  2021-11-23 11:11:00.000     0                2                0
------------------------------ more then 5 min -----------------------------
5         1  2021-11-23 11:22:00.000     0                2                0 <-- single, not adding.
---------------------------------- new user --------------------------------
6         2  2021-11-23 11:40:00.000     1                0                0
7         2  2021-11-23 11:41:00.000     1                1                0
8         2  2021-11-23 11:42:00.000     1                2                0
--------------------------- switching to cellular --------------------------
9         2  2021-11-23 11:43:00.000     0                2                0
10        2  2021-11-23 11:44:00.000     0                2                1
11        2  2021-11-23 11:48:00.000     0                2                5

我编写了以下代码,用唯一 ID 标记每 5 分钟的会话:

df['timestamp'] = pd.to_datetime(df.timestamp)
df['session_grp'] = df.groupby('user_id').apply(
    lambda x: (x.groupby([pd.Grouper(key="timestamp", freq='5min', origin='start')])).ngroup()).reset_index(
    drop=True).values.reshape(-1)

它似乎工作正常:

    user_id           timestamp  wifi  session_grp
0         1 2021-11-23 11:00:00     1            0
1         1 2021-11-23 11:01:00     1            0
2         1 2021-11-23 11:02:00     1            0
3         1 2021-11-23 11:10:00     1            2
4         1 2021-11-23 11:11:00     0            2
5         1 2021-11-23 11:22:00     0            4
6         2 2021-11-23 11:40:00     1            0
7         2 2021-11-23 11:41:00     1            0
8         2 2021-11-23 11:42:00     1            0
9         2 2021-11-23 11:43:00     0            0
10        2 2021-11-23 11:44:00     0            0
11        2 2021-11-23 11:48:00     0            1

但就是这样,我被卡住了。任何帮助将不胜感激。

# convert to datetime
df['timestamp'] = pd.to_datetime(df['timestamp'])
# groupby 5 minute intervals
df['grp'] = df.groupby('user_id')['timestamp'].diff().dt.seconds.gt(300).cumsum()  
# calc the difference and fill misssing values with 0
df['diff'] = df.groupby(['user_id', 'wifi', 'grp'])['timestamp'].diff().fillna(pd.Timedelta(0))

# use loc to filter the frame and assign the diff value for each slice (i.e., wifi and cell)
df.loc[df['wifi'] == 1, 'wifi_time_spent'] = df.loc[df['wifi'] == 1, 'diff']
df.loc[df['wifi'] == 0, 'cell_time_spent'] = df.loc[df['wifi'] == 0, 'diff']
# drop columns not needed and fill the missing values with 0
df = df.drop(columns=['grp', 'diff']).fillna(pd.Timedelta(0))  

# groupby one more time and calculate the cumsum for each column
df['wifi_time_spent'] = df.groupby('user_id')['wifi_time_spent'].cumsum()
df['cell_time_spent'] = df.groupby('user_id')['cell_time_spent'].cumsum()

出来

    user_id           timestamp  wifi wifi_time_spent cell_time_spent
0         1 2021-11-23 11:00:00     1 0 days 00:00:00 0 days 00:00:00
1         1 2021-11-23 11:01:00     1 0 days 00:01:00 0 days 00:00:00
2         1 2021-11-23 11:02:00     1 0 days 00:02:00 0 days 00:00:00
3         1 2021-11-23 11:10:00     1 0 days 00:02:00 0 days 00:00:00
4         1 2021-11-23 11:11:00     0 0 days 00:02:00 0 days 00:00:00
5         1 2021-11-23 11:22:00     0 0 days 00:02:00 0 days 00:00:00
6         2 2021-11-23 11:40:00     1 0 days 00:00:00 0 days 00:00:00
7         2 2021-11-23 11:41:00     1 0 days 00:01:00 0 days 00:00:00
8         2 2021-11-23 11:42:00     1 0 days 00:02:00 0 days 00:00:00
9         2 2021-11-23 11:43:00     0 0 days 00:02:00 0 days 00:00:00
10        2 2021-11-23 11:44:00     0 0 days 00:02:00 0 days 00:01:00
11        2 2021-11-23 11:48:00     0 0 days 00:02:00 0 days 00:05:00