PostgreSQL 计算每小时最大并发用户会话数

PostgreSQL count max number of concurrent user sessions per hour

情况

我们有一个 PostgreSQL 9.1 数据库,其中包含每行登录 date/time 和注销 date/time 的用户会话。 Table 看起来像这样:

    user_id |        login_ts       |         logout_ts  
------------+--------------+--------------------------------
USER1      |  2021-02-03 09:23:00  |   2021-02-03 11:44:00
USER2      |  2021-02-03 10:49:00  |   2021-02-03 13:30:00
USER3      |  2021-02-03 13:32:00  |   2021-02-03 15:31:00
USER4      |  2021-02-04 13:50:00  |   2021-02-04 14:53:00
USER5      |  2021-02-04 14:44:00  |   2021-02-04 15:21:00
USER6      |  2021-02-04 14:52:00  |   2021-02-04 17:59:00

目标

想获取时间范围内每天每24小时的最大并发用户数。像这样:

date       | hour  | sessions
-----------+-------+-----------
2021-02-03 | 01:00 | 0
2021-02-03 | 02:00 | 0
2021-02-03 | 03:00 | 0
2021-02-03 | 04:00 | 0
2021-02-03 | 05:00 | 0
2021-02-03 | 06:00 | 0
2021-02-03 | 07:00 | 0
2021-02-03 | 08:00 | 0
2021-02-03 | 09:00 | 1
2021-02-03 | 10:00 | 2
2021-02-03 | 11:00 | 2
2021-02-03 | 12:00 | 1
2021-02-03 | 13:00 | 1
2021-02-03 | 14:00 | 1
2021-02-03 | 15:00 | 0
2021-02-03 | 16:00 | 0
2021-02-03 | 17:00 | 0
2021-02-03 | 18:00 | 0
2021-02-03 | 19:00 | 0
2021-02-03 | 20:00 | 0
2021-02-03 | 21:00 | 0
2021-02-03 | 22:00 | 0
2021-02-03 | 23:00 | 0
2021-02-03 | 24:00 | 0
2021-02-04 | 01:00 | 0
2021-02-04 | 02:00 | 0
2021-02-04 | 03:00 | 0
2021-02-04 | 04:00 | 0
2021-02-04 | 05:00 | 0
2021-02-04 | 06:00 | 0
2021-02-04 | 07:00 | 0
2021-02-04 | 08:00 | 0
2021-02-04 | 09:00 | 0
2021-02-04 | 10:00 | 0
2021-02-04 | 11:00 | 0
2021-02-04 | 12:00 | 0
2021-02-04 | 13:00 | 1
2021-02-04 | 14:00 | 3
2021-02-04 | 15:00 | 1
2021-02-04 | 16:00 | 1
2021-02-04 | 17:00 | 1
2021-02-04 | 18:00 | 0
2021-02-04 | 19:00 | 0
2021-02-04 | 20:00 | 0
2021-02-04 | 21:00 | 0
2021-02-04 | 22:00 | 0
2021-02-04 | 23:00 | 0
2021-02-04 | 24:00 | 0

注意事项

类似问题

这里回答了一个类似的问题: by Erwin Brandstetter。但是,这是每天而不是每小时,而且我显然是 postgreSQL 的新手,无法将其转换为每小时,所以我希望有人能提供帮助。

对于任何时间段,您都可以使用 SQL 中的 OVERLAPS 运算符计算并发会话数:

CREATE TEMP TABLE sessions (
  user_id text not null,
  login_ts timestamp,
  logout_ts timestamp );

INSERT INTO sessions SELECT 'webuser', d,
  d+((1+random()*300)::text||' seconds')::interval
FROM generate_series(
  '2021-02-28 07:42'::timestamp,
  '2021-03-01 07:42'::timestamp,
  '5 seconds'::interval) AS d;

SELECT s1.user_id, s1.login_ts, s1.logout_ts, 
(select count(*) FROM sessions s2 
 WHERE (s2.login_ts, s2.logout_ts) OVERLAPS (s1.login_ts, s1.logout_ts)) 
 AS parallel_sessions
FROM sessions s1 LIMIT 10;

 user_id |      login_ts       |         logout_ts          | parallel_sessions
---------+---------------------+----------------------------+------------------
 webuser | 2021-02-28 07:42:00 | 2021-02-28 07:42:25.528594 |                6
 webuser | 2021-02-28 07:42:05 | 2021-02-28 07:45:50.513769 |               47
 webuser | 2021-02-28 07:42:10 | 2021-02-28 07:44:18.810066 |               28
 webuser | 2021-02-28 07:42:15 | 2021-02-28 07:45:17.3888   |               40
 webuser | 2021-02-28 07:42:20 | 2021-02-28 07:43:14.325476 |               15
 webuser | 2021-02-28 07:42:25 | 2021-02-28 07:43:44.174841 |               21
 webuser | 2021-02-28 07:42:30 | 2021-02-28 07:43:32.679052 |               18
 webuser | 2021-02-28 07:42:35 | 2021-02-28 07:45:12.554117 |               38
 webuser | 2021-02-28 07:42:40 | 2021-02-28 07:46:37.94311  |               55
 webuser | 2021-02-28 07:42:45 | 2021-02-28 07:43:08.398444 |               13
(10 rows)

这适用于小型数据集,但为了获得更好的性能,请使用 PostgreSQL Range Types,如下所示。这适用于 postgres 9.2 及更高版本。

ALTER TABLE sessions ADD timerange tsrange;
UPDATE sessions SET timerange = tsrange(login_ts,logout_ts);
CREATE INDEX ON sessions USING gist (timerange);

CREATE TEMP TABLE level1 AS
SELECT s1.user_id, s1.login_ts, s1.logout_ts,
(select count(*) FROM sessions s2 
 WHERE s2.timerange && s1.timerange) AS parallel_sessions
FROM sessions s1;

SELECT date_trunc('hour',login_ts) AS hour, count(*),
max(parallel_sessions)
FROM level1
GROUP BY hour;
        hour         | count | max 
---------------------+-------+-----
 2021-02-28 14:00:00 |   720 |  98
 2021-03-01 03:00:00 |   720 |  99
 2021-03-01 06:00:00 |   720 |  94
 2021-02-28 09:00:00 |   720 |  96
 2021-02-28 10:00:00 |   720 |  97
 2021-02-28 18:00:00 |   720 |  94
 2021-02-28 11:00:00 |   720 |  97
 2021-03-01 00:00:00 |   720 |  97
 2021-02-28 19:00:00 |   720 |  99
 2021-02-28 16:00:00 |   720 |  94
 2021-02-28 17:00:00 |   720 |  95
 2021-03-01 02:00:00 |   720 |  99
 2021-02-28 08:00:00 |   720 |  96
 2021-02-28 23:00:00 |   720 |  94
 2021-03-01 07:00:00 |   505 |  92
 2021-03-01 04:00:00 |   720 |  95
 2021-02-28 21:00:00 |   720 |  97
 2021-03-01 01:00:00 |   720 |  93
 2021-02-28 22:00:00 |   720 |  96
 2021-03-01 05:00:00 |   720 |  93
 2021-02-28 20:00:00 |   720 |  95
 2021-02-28 13:00:00 |   720 |  95
 2021-02-28 12:00:00 |   720 |  97
 2021-02-28 15:00:00 |   720 |  98
 2021-02-28 07:00:00 |   216 |  93
(25 rows)

我会将其分解为两个问题:

  1. 找出重叠的数量以及它们开始和结束的时间。
  2. 查找时间。

注意两点:

  • 我假设 '2014-04-03 17:59:00' 是一个错字。
  • 以下内容在一小时开始时将 date/hour 放在一个列中。

首先,计算重叠。为此,取消登录和注销。为登录输入 +1 的计数器,为注销输入 -1 的计数器并进行累计。这看起来像:

with overlap as (
      select v.ts, sum(v.inc) as inc,
             sum(sum(v.inc)) over (order by v.ts) as num_overlaps,
             lead(v.ts) over (order by v.ts) as next_ts
      from sessions s cross join lateral
           (values (login_ts, 1), (logout_ts, -1)) v(ts, inc)
      group by v.ts
     )
select *
from overlap
order by ts;

对于下一步,使用 generate_series() 生成相隔一小时的时间戳。使用 left joingroup by:

查找该期间的最大值
with overlap as (
      select v.ts, sum(v.inc) as inc,
             sum(sum(v.inc)) over (order by v.ts) as num_overlaps,
             lead(v.ts) over (order by v.ts) as next_ts
      from sessions s cross join lateral
           (values (login_ts, 1), (logout_ts, -1)) v(ts, inc)
      group by v.ts
     )
select gs.hh, coalesce(max(o.num_overlaps), 0) as num_overlaps
from generate_series('2021-02-03'::date, '2021-02-05'::date, interval '1 hour') gs(hh) left join
     overlap o
     on o.ts < gs.hh + interval '1 hour' and
        o.next_ts > gs.hh
group by gs.hh
order by gs.hh;

Here 是一个 db<>fiddle 使用您的数据固定的最后一条记录的合理注销时间。