获取每天的活跃用户列表

Get list of active users per day

我有一个数据集,其中包含每 15 分钟连接到服务器的用户列表,例如

May 7, 2020, 8:09 AM   user1
May 7, 2020, 8:09 AM   user2
...
May 7, 2020, 8:24 AM   user1
May 7, 2020, 8:24 AM   user3
... 

并且我希望获得每天的活跃用户数,例如

May 7, 2020   71
May 8, 2020   83

现在,棘手的部分。 如果在过去 7 天内 he/she 的连接时间达到 80% 或更多,则定义为活跃用户。 这意味着,如果一周有 672 个 15 分钟间隔 (1440 / 15 x 7),那么一个用户必须显示 538 (672 x 0.8) 次。

到目前为止我的代码是:

SELECT
    DATE_TRUNC('week', ts) AS ts_week
    ,COUNT(DISTINCT user)
FROM activeusers
GROUP BY 1

仅提供每周连接的唯一用户列表。

July 13, 2020, 12:00 AM   435
July 20, 2020, 12:00 AM   267

但我想实现活动用户定义,并获取每天的结果,而不仅仅是星期一。

由此产生的特殊困难是,如果用户在过去 6 天内有足够的连接,则他们可能有资格享受完全没有连接的日子。

这使得使用 window 函数变得更加困难。在 LATERAL 子查询中聚合是显而易见的选择:

WITH daily AS (  -- ① granulate daily
   SELECT ts::date AS the_day
        , "user"
        , count(*)::int AS daily_cons
   FROM   activeusers
   GROUP  BY 1, 2
  )
SELECT d.the_day, count("user") AS active_users
FROM  ( --  ② time frame
   SELECT generate_series (timestamp '2020-07-01'
                         , LOCALTIMESTAMP
                         , interval '1 day')::date
   ) d(the_day)
LEFT   JOIN LATERAL (
   SELECT "user"
   FROM   daily d
   WHERE  d.the_day >= d.the_day - 6
   AND    d.the_day <= d.the_day
   GROUP  BY "user"
   HAVING sum(daily_cons) >= 538  -- ③
   ) sum7 ON true
ORDER  BY d.the_day;

① CTE daily 是可选的,但从每日聚合开始应该对性能有很大帮助很多

② 您必须以某种方式定义时间范围。我选择了当年。替换为您的选择。要使用 table 中存在的总范围,请改用:

SELECT generate_series (min(the_day)::timestamp
                      , max(the_day)::timestamp
                      , interval '1 day')::date AS the_day
FROM   daily

在这里考虑基础知识:

  • Generating time series between two dates in PostgreSQL

这也克服了上面提到的“特殊困难”

HAVING子句中的条件排除了过去7天(包括“今天”)连接不足的所有行。

相关:

  • Best way to count records by arbitrary time intervals in Rails+Postgres
  • Total Number of Records per Week

旁白:
您不会真正使用 reserved word "user" 作为标识符。

因为您想要每天的活跃用户但按周确定,我认为您可以使用 CROSS APPLY 来复制每天的计数。查询的 FROM 部分将为您提供日期和用户,CROSS APPLY 将限制为活跃用户。您可以在最后的 WHERE 中指定您想要的用户或日期。

SELECT users.UserName, users.LogDate
FROM (
    SELECT UserName, CAST(ts AS DATE) AS LogDate
    FROM activeusers
    GROUP BY CAST(ts AS DATE)
    ) AS users
CROSS APPLY (
    SELECT UserName, COUNT(1)
    FROM activeusers AS a
    WHERE a.UserName = users.UserName AND CAST(ts AS DATE) BETWEEN DATEADD(WEEK, -1, LogDate) AND LogDate
    GROUP BY UserName
    HAVING COUNT(1) >= 538
    ) AS activeUsers
WHERE users.LogDate > '2020-01-01' AND users.UserName = 'user1'

这是 SQL 服务器,您可能需要针对 PostgreSQL 进行修改。 CROSS APPLY 可能转化为 LEFT JOIN LATERAL (...) ON true.

我已经为设备监控报告做了类似的事情。我从来没有想出一个解决方案,它不涉及构建日历并将其交叉连接到不同的设备列表(user 值在你的情况下)。

这个故意冗长的查询构建交叉连接,获取每个 userddate 的活动计数,在 7 天内执行 运行 sum(),然后计算给定 ddate 上的用户数量,在 ddate 结束的 7 天内有 538 或更多的活跃用户。

with drange as (
  select min(ts) as start_ts, max(ts) as end_ts
    from activeusers
), alldates as (
  select (start_ts + make_interval(days := x))::date as ddate
    from drange
   cross join generate_series(0, date_part('day', end_ts - start_ts)::int) as gs(x)
), user_dates as (
  select ddate, "user"
    from alldates
   cross join (select distinct "user" from activeusers) u
), user_date_counts as (
  select u.ddate, u."user",
         sum(case when a.user is null then 0 else 1 end) as actives
    from user_dates u
    left join activeusers a
           on a."user" = u."user"
          and a.ts::date = u.ddate
   group by u.ddate, u."user"
), running_window as (
  select ddate, "user",
         sum(actives) over (partition by user
                                order by ddate
                         rows between 6 preceding
                                  and current row) seven_days
    from user_date_counts
), flag_active as (
  select ddate, "user",
         seven_days >= 538 as is_active
    from running_window
)
select ddate, count(*) as active_users
  from flag_active
 where is_active
 group by ddate
;