Pandas - 按时间增量计算连续行数并计数
Pandas - count number of continuous rows by time delta and count
有以下DF:
id timestamp
0 1 2020-09-01 15:14:35
1 1 2020-09-01 15:15:40
2 1 2020-09-01 15:16:59
3 1 2020-09-01 15:24:42
4 1 2020-09-01 15:25:50
5 1 2020-09-01 15:26:40
6 2 2020-09-01 18:14:35
7 2 2020-09-01 18:17:39
8 2 2020-09-01 18:24:40
9 2 2020-09-01 18:24:42
10 2 2020-09-01 18:34:40
11 2 2020-09-01 18:35:40
12 2 2020-09-01 18:36:40
每个id是一个server endpoint,timestamp是单次请求的时间。绘制时间线图:
我想统计每台服务器的负载周期数,我这样定义一个负载周期:
至少 3 个请求的时间差小于 5 分钟。
因此服务器 1 有 2 个负载,而服务器 2 只有 1 个负载。我希望输出如下:
id timestamp loads_detected
0 1 2020-09-01 15:14:35 0
1 1 2020-09-01 15:15:40 0
2 1 2020-09-01 15:16:59 1 <-- 3 requests in a row with less than 5 minuets a part
3 1 2020-09-01 15:25:42 1 <-- next request is more than 5 minutes
4 1 2020-09-01 15:25:50 1
5 1 2020-09-01 15:26:40 2 <-- 3 requests in a row with less than 5 minuets a part
6 2 2020-09-01 18:14:35 0
7 2 2020-09-01 18:17:39 0 <-- Only 2 requests with less than 5 minuets, not increasing counter
8 2 2020-09-01 18:24:40 0
9 2 2020-09-01 18:24:42 0
10 2 2020-09-01 18:34:40 0
11 2 2020-09-01 18:35:40 0
12 2 2020-09-01 18:36:40 1 <-- 3 requests in a row with less than 5 minuets a part
任何帮助将不胜感激:)
IIUC,您可以按 id 和 5 分钟的 frequency 进行分组,计算 3 个连续请求出现的次数并且然后对该结果进行 cumsum:
df['loads_detected'] = df.groupby(['id', pd.Grouper(key="timestamp", freq='5min', origin='start')]).cumcount().eq(2)
df['loads_detected'] = df.groupby('id').cumsum()
print(df)
输出
id timestamp loads_detected
0 1 2020-09-01 15:14:35 0
1 1 2020-09-01 15:15:40 0
2 1 2020-09-01 15:16:59 1
3 1 2020-09-01 15:24:42 1
4 1 2020-09-01 15:25:50 1
5 1 2020-09-01 15:26:40 2
6 2 2020-09-01 18:14:35 0
7 2 2020-09-01 18:17:39 0
8 2 2020-09-01 18:24:40 0
9 2 2020-09-01 18:24:42 0
10 2 2020-09-01 18:34:40 0
11 2 2020-09-01 18:35:40 0
12 2 2020-09-01 18:36:40 1
有以下DF:
id timestamp
0 1 2020-09-01 15:14:35
1 1 2020-09-01 15:15:40
2 1 2020-09-01 15:16:59
3 1 2020-09-01 15:24:42
4 1 2020-09-01 15:25:50
5 1 2020-09-01 15:26:40
6 2 2020-09-01 18:14:35
7 2 2020-09-01 18:17:39
8 2 2020-09-01 18:24:40
9 2 2020-09-01 18:24:42
10 2 2020-09-01 18:34:40
11 2 2020-09-01 18:35:40
12 2 2020-09-01 18:36:40
每个id是一个server endpoint,timestamp是单次请求的时间。绘制时间线图:
我想统计每台服务器的负载周期数,我这样定义一个负载周期:
至少 3 个请求的时间差小于 5 分钟。
因此服务器 1 有 2 个负载,而服务器 2 只有 1 个负载。我希望输出如下:
id timestamp loads_detected
0 1 2020-09-01 15:14:35 0
1 1 2020-09-01 15:15:40 0
2 1 2020-09-01 15:16:59 1 <-- 3 requests in a row with less than 5 minuets a part
3 1 2020-09-01 15:25:42 1 <-- next request is more than 5 minutes
4 1 2020-09-01 15:25:50 1
5 1 2020-09-01 15:26:40 2 <-- 3 requests in a row with less than 5 minuets a part
6 2 2020-09-01 18:14:35 0
7 2 2020-09-01 18:17:39 0 <-- Only 2 requests with less than 5 minuets, not increasing counter
8 2 2020-09-01 18:24:40 0
9 2 2020-09-01 18:24:42 0
10 2 2020-09-01 18:34:40 0
11 2 2020-09-01 18:35:40 0
12 2 2020-09-01 18:36:40 1 <-- 3 requests in a row with less than 5 minuets a part
任何帮助将不胜感激:)
IIUC,您可以按 id 和 5 分钟的 frequency 进行分组,计算 3 个连续请求出现的次数并且然后对该结果进行 cumsum:
df['loads_detected'] = df.groupby(['id', pd.Grouper(key="timestamp", freq='5min', origin='start')]).cumcount().eq(2)
df['loads_detected'] = df.groupby('id').cumsum()
print(df)
输出
id timestamp loads_detected
0 1 2020-09-01 15:14:35 0
1 1 2020-09-01 15:15:40 0
2 1 2020-09-01 15:16:59 1
3 1 2020-09-01 15:24:42 1
4 1 2020-09-01 15:25:50 1
5 1 2020-09-01 15:26:40 2
6 2 2020-09-01 18:14:35 0
7 2 2020-09-01 18:17:39 0
8 2 2020-09-01 18:24:40 0
9 2 2020-09-01 18:24:42 0
10 2 2020-09-01 18:34:40 0
11 2 2020-09-01 18:35:40 0
12 2 2020-09-01 18:36:40 1