Pandas 根据某些条件计算时间增量
Pandas calculate time deltas based on some conditions
拥有以下 DF 的移动用户 activity:
user_id timestamp wifi
0 1 2021-11-23 11:00:00.000 1
1 1 2021-11-23 11:01:00.000 1
2 1 2021-11-23 11:02:00.000 1
3 1 2021-11-23 11:10:00.000 1
4 1 2021-11-23 11:11:00.000 0
5 1 2021-11-23 11:22:00.000 0
6 2 2021-11-23 11:40:00.000 1
7 2 2021-11-23 11:41:00.000 1
8 2 2021-11-23 11:42:00.000 1
9 2 2021-11-23 11:43:00.000 0
10 2 2021-11-23 11:44:00.000 0
11 2 2021-11-23 11:48:00.000 0
user_id: 用户识别
timestamp: 日志时间
wifi:布尔 wifi 或蜂窝网络使用情况
我想计算 wifi 和蜂窝网络连接的时间使用情况,但有以下限制:
- 持续使用由相隔不到 5 分钟的两行定义。
- 没有连续事件的行将不被计算在内。
结果应该如下:
为简单起见,我在花费的时间列中填充了一个描述经过分钟数的数字。实际值应该是时间增量。
user_id timestamp wifi wifi_time_spent cell_time_spent
0 1 2021-11-23 11:00:00.000 1 0 0
1 1 2021-11-23 11:01:00.000 1 1 0
2 1 2021-11-23 11:02:00.000 1 2 0
------------------------------ more then 5 min -----------------------------
3 1 2021-11-23 11:10:00.000 1 2 0 <-- single, not adding.
---------------- only 1 event before switching to cellular ----------------
4 1 2021-11-23 11:11:00.000 0 2 0
------------------------------ more then 5 min -----------------------------
5 1 2021-11-23 11:22:00.000 0 2 0 <-- single, not adding.
---------------------------------- new user --------------------------------
6 2 2021-11-23 11:40:00.000 1 0 0
7 2 2021-11-23 11:41:00.000 1 1 0
8 2 2021-11-23 11:42:00.000 1 2 0
--------------------------- switching to cellular --------------------------
9 2 2021-11-23 11:43:00.000 0 2 0
10 2 2021-11-23 11:44:00.000 0 2 1
11 2 2021-11-23 11:48:00.000 0 2 5
我编写了以下代码,用唯一 ID 标记每 5 分钟的会话:
df['timestamp'] = pd.to_datetime(df.timestamp)
df['session_grp'] = df.groupby('user_id').apply(
lambda x: (x.groupby([pd.Grouper(key="timestamp", freq='5min', origin='start')])).ngroup()).reset_index(
drop=True).values.reshape(-1)
它似乎工作正常:
user_id timestamp wifi session_grp
0 1 2021-11-23 11:00:00 1 0
1 1 2021-11-23 11:01:00 1 0
2 1 2021-11-23 11:02:00 1 0
3 1 2021-11-23 11:10:00 1 2
4 1 2021-11-23 11:11:00 0 2
5 1 2021-11-23 11:22:00 0 4
6 2 2021-11-23 11:40:00 1 0
7 2 2021-11-23 11:41:00 1 0
8 2 2021-11-23 11:42:00 1 0
9 2 2021-11-23 11:43:00 0 0
10 2 2021-11-23 11:44:00 0 0
11 2 2021-11-23 11:48:00 0 1
但就是这样,我被卡住了。任何帮助将不胜感激。
# convert to datetime
df['timestamp'] = pd.to_datetime(df['timestamp'])
# groupby 5 minute intervals
df['grp'] = df.groupby('user_id')['timestamp'].diff().dt.seconds.gt(300).cumsum()
# calc the difference and fill misssing values with 0
df['diff'] = df.groupby(['user_id', 'wifi', 'grp'])['timestamp'].diff().fillna(pd.Timedelta(0))
# use loc to filter the frame and assign the diff value for each slice (i.e., wifi and cell)
df.loc[df['wifi'] == 1, 'wifi_time_spent'] = df.loc[df['wifi'] == 1, 'diff']
df.loc[df['wifi'] == 0, 'cell_time_spent'] = df.loc[df['wifi'] == 0, 'diff']
# drop columns not needed and fill the missing values with 0
df = df.drop(columns=['grp', 'diff']).fillna(pd.Timedelta(0))
# groupby one more time and calculate the cumsum for each column
df['wifi_time_spent'] = df.groupby('user_id')['wifi_time_spent'].cumsum()
df['cell_time_spent'] = df.groupby('user_id')['cell_time_spent'].cumsum()
出来
user_id timestamp wifi wifi_time_spent cell_time_spent
0 1 2021-11-23 11:00:00 1 0 days 00:00:00 0 days 00:00:00
1 1 2021-11-23 11:01:00 1 0 days 00:01:00 0 days 00:00:00
2 1 2021-11-23 11:02:00 1 0 days 00:02:00 0 days 00:00:00
3 1 2021-11-23 11:10:00 1 0 days 00:02:00 0 days 00:00:00
4 1 2021-11-23 11:11:00 0 0 days 00:02:00 0 days 00:00:00
5 1 2021-11-23 11:22:00 0 0 days 00:02:00 0 days 00:00:00
6 2 2021-11-23 11:40:00 1 0 days 00:00:00 0 days 00:00:00
7 2 2021-11-23 11:41:00 1 0 days 00:01:00 0 days 00:00:00
8 2 2021-11-23 11:42:00 1 0 days 00:02:00 0 days 00:00:00
9 2 2021-11-23 11:43:00 0 0 days 00:02:00 0 days 00:00:00
10 2 2021-11-23 11:44:00 0 0 days 00:02:00 0 days 00:01:00
11 2 2021-11-23 11:48:00 0 0 days 00:02:00 0 days 00:05:00
拥有以下 DF 的移动用户 activity:
user_id timestamp wifi
0 1 2021-11-23 11:00:00.000 1
1 1 2021-11-23 11:01:00.000 1
2 1 2021-11-23 11:02:00.000 1
3 1 2021-11-23 11:10:00.000 1
4 1 2021-11-23 11:11:00.000 0
5 1 2021-11-23 11:22:00.000 0
6 2 2021-11-23 11:40:00.000 1
7 2 2021-11-23 11:41:00.000 1
8 2 2021-11-23 11:42:00.000 1
9 2 2021-11-23 11:43:00.000 0
10 2 2021-11-23 11:44:00.000 0
11 2 2021-11-23 11:48:00.000 0
user_id: 用户识别
timestamp: 日志时间
wifi:布尔 wifi 或蜂窝网络使用情况
我想计算 wifi 和蜂窝网络连接的时间使用情况,但有以下限制:
- 持续使用由相隔不到 5 分钟的两行定义。
- 没有连续事件的行将不被计算在内。
结果应该如下: 为简单起见,我在花费的时间列中填充了一个描述经过分钟数的数字。实际值应该是时间增量。
user_id timestamp wifi wifi_time_spent cell_time_spent
0 1 2021-11-23 11:00:00.000 1 0 0
1 1 2021-11-23 11:01:00.000 1 1 0
2 1 2021-11-23 11:02:00.000 1 2 0
------------------------------ more then 5 min -----------------------------
3 1 2021-11-23 11:10:00.000 1 2 0 <-- single, not adding.
---------------- only 1 event before switching to cellular ----------------
4 1 2021-11-23 11:11:00.000 0 2 0
------------------------------ more then 5 min -----------------------------
5 1 2021-11-23 11:22:00.000 0 2 0 <-- single, not adding.
---------------------------------- new user --------------------------------
6 2 2021-11-23 11:40:00.000 1 0 0
7 2 2021-11-23 11:41:00.000 1 1 0
8 2 2021-11-23 11:42:00.000 1 2 0
--------------------------- switching to cellular --------------------------
9 2 2021-11-23 11:43:00.000 0 2 0
10 2 2021-11-23 11:44:00.000 0 2 1
11 2 2021-11-23 11:48:00.000 0 2 5
我编写了以下代码,用唯一 ID 标记每 5 分钟的会话:
df['timestamp'] = pd.to_datetime(df.timestamp)
df['session_grp'] = df.groupby('user_id').apply(
lambda x: (x.groupby([pd.Grouper(key="timestamp", freq='5min', origin='start')])).ngroup()).reset_index(
drop=True).values.reshape(-1)
它似乎工作正常:
user_id timestamp wifi session_grp
0 1 2021-11-23 11:00:00 1 0
1 1 2021-11-23 11:01:00 1 0
2 1 2021-11-23 11:02:00 1 0
3 1 2021-11-23 11:10:00 1 2
4 1 2021-11-23 11:11:00 0 2
5 1 2021-11-23 11:22:00 0 4
6 2 2021-11-23 11:40:00 1 0
7 2 2021-11-23 11:41:00 1 0
8 2 2021-11-23 11:42:00 1 0
9 2 2021-11-23 11:43:00 0 0
10 2 2021-11-23 11:44:00 0 0
11 2 2021-11-23 11:48:00 0 1
但就是这样,我被卡住了。任何帮助将不胜感激。
# convert to datetime
df['timestamp'] = pd.to_datetime(df['timestamp'])
# groupby 5 minute intervals
df['grp'] = df.groupby('user_id')['timestamp'].diff().dt.seconds.gt(300).cumsum()
# calc the difference and fill misssing values with 0
df['diff'] = df.groupby(['user_id', 'wifi', 'grp'])['timestamp'].diff().fillna(pd.Timedelta(0))
# use loc to filter the frame and assign the diff value for each slice (i.e., wifi and cell)
df.loc[df['wifi'] == 1, 'wifi_time_spent'] = df.loc[df['wifi'] == 1, 'diff']
df.loc[df['wifi'] == 0, 'cell_time_spent'] = df.loc[df['wifi'] == 0, 'diff']
# drop columns not needed and fill the missing values with 0
df = df.drop(columns=['grp', 'diff']).fillna(pd.Timedelta(0))
# groupby one more time and calculate the cumsum for each column
df['wifi_time_spent'] = df.groupby('user_id')['wifi_time_spent'].cumsum()
df['cell_time_spent'] = df.groupby('user_id')['cell_time_spent'].cumsum()
出来
user_id timestamp wifi wifi_time_spent cell_time_spent
0 1 2021-11-23 11:00:00 1 0 days 00:00:00 0 days 00:00:00
1 1 2021-11-23 11:01:00 1 0 days 00:01:00 0 days 00:00:00
2 1 2021-11-23 11:02:00 1 0 days 00:02:00 0 days 00:00:00
3 1 2021-11-23 11:10:00 1 0 days 00:02:00 0 days 00:00:00
4 1 2021-11-23 11:11:00 0 0 days 00:02:00 0 days 00:00:00
5 1 2021-11-23 11:22:00 0 0 days 00:02:00 0 days 00:00:00
6 2 2021-11-23 11:40:00 1 0 days 00:00:00 0 days 00:00:00
7 2 2021-11-23 11:41:00 1 0 days 00:01:00 0 days 00:00:00
8 2 2021-11-23 11:42:00 1 0 days 00:02:00 0 days 00:00:00
9 2 2021-11-23 11:43:00 0 0 days 00:02:00 0 days 00:00:00
10 2 2021-11-23 11:44:00 0 0 days 00:02:00 0 days 00:01:00
11 2 2021-11-23 11:48:00 0 0 days 00:02:00 0 days 00:05:00