SQL 30天活跃用户查询

SQL 30 day active user query

我有 table 个用户,他们在给定日期触发了多少事件:

DATE USERID EVENTS
2021-08-27 1 5
2021-07-25 1 7
2021-07-23 2 3
2021-07-20 3 9
2021-06-22 1 9
2021-05-05 1 4
2021-05-05 2 2
2021-05-05 3 6
2021-05-05 4 8
2021-05-05 5 1

我想创建一个 table 显示每个日期的活跃用户数量,活跃用户被定义为在给定日期或之前 30 天内的任何一天触发事件的人。

DATE ACTIVE_USERS
2021-08-27 1
2021-07-25 3
2021-07-23 2
2021-07-20 2
2021-06-22 1
2021-05-05 5

我尝试了以下查询,它只返回在指定日期活跃的用户:

SELECT COUNT(DISTINCT USERID), DATE
FROM table
WHERE DATE >= (CURRENT_DATE() - interval '30 days')
GROUP BY 2 ORDER BY 2 DESC;

我也尝试使用 window 函数,但似乎最终得到了相同的结果:

SELECT
    DATE,
    SUM(ACTIVE_USERS) AS ACTIVE_USERS
FROM
(
SELECT
    DATE,
    CASE
        WHEN SUM(EVENTS) OVER (PARTITION BY USERID ORDER BY DATE ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) >= 1 THEN 1
        ELSE 0
    END AS ACTIVE_USERS
FROM table
)
GROUP BY 1
ORDER BY 1

我在 Snowflake 上使用 SQL:ANSI。任何建议将不胜感激。

作为 window 函数,这很棘手 -- 因为 count(distinct) 是不允许的。您可以使用自连接:

select t1.date, count(distinct t2.userid)
from table t join
     table t2
     on t2.date <= t.date and
        t2.date > t.date - interval '30 day'
group by t1.date;

但是,这可能很昂贵。一种解决方案是“逆透视”数据。也就是说,对每个用户“进入”和“退出”活动状态进行增量计数,然后进行累加:

with d as (  -- calculate the dates with "ins" and "outs"
      select user, date, +1 as inc
      from table
      union all
      select user, date + interval '30 day', -1 as inc
      from table
     ),
     d2 as (  -- accumulate to get the net actives per day
      select date, user, sum(inc) as change_on_day,
             sum(sum(inc)) over (partition by user order by date) as running_inc
      from d
      group by date, user
     ),
     d3 as (  -- summarize into active periods
      select user, min(date) as start_date, max(date) as end_date
      from (select d2.*,
                   sum(case when running_inc = 0 then 1 else 0 end) over (partition by user order by date) as active_period
            from d2
           ) d2
      where running_inc > 0
      group by user
     )
select d.date, count(d3.user)
from (select distinct date from table) d left join
     d3
     on d.date >= start_date and d.date < end_date
group by d.date;