SQL 30天活跃用户查询
SQL 30 day active user query
我有 table 个用户,他们在给定日期触发了多少事件:
DATE
USERID
EVENTS
2021-08-27
1
5
2021-07-25
1
7
2021-07-23
2
3
2021-07-20
3
9
2021-06-22
1
9
2021-05-05
1
4
2021-05-05
2
2
2021-05-05
3
6
2021-05-05
4
8
2021-05-05
5
1
我想创建一个 table 显示每个日期的活跃用户数量,活跃用户被定义为在给定日期或之前 30 天内的任何一天触发事件的人。
DATE
ACTIVE_USERS
2021-08-27
1
2021-07-25
3
2021-07-23
2
2021-07-20
2
2021-06-22
1
2021-05-05
5
我尝试了以下查询,它只返回在指定日期活跃的用户:
SELECT COUNT(DISTINCT USERID), DATE
FROM table
WHERE DATE >= (CURRENT_DATE() - interval '30 days')
GROUP BY 2 ORDER BY 2 DESC;
我也尝试使用 window 函数,但似乎最终得到了相同的结果:
SELECT
DATE,
SUM(ACTIVE_USERS) AS ACTIVE_USERS
FROM
(
SELECT
DATE,
CASE
WHEN SUM(EVENTS) OVER (PARTITION BY USERID ORDER BY DATE ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) >= 1 THEN 1
ELSE 0
END AS ACTIVE_USERS
FROM table
)
GROUP BY 1
ORDER BY 1
我在 Snowflake 上使用 SQL:ANSI。任何建议将不胜感激。
作为 window 函数,这很棘手 -- 因为 count(distinct)
是不允许的。您可以使用自连接:
select t1.date, count(distinct t2.userid)
from table t join
table t2
on t2.date <= t.date and
t2.date > t.date - interval '30 day'
group by t1.date;
但是,这可能很昂贵。一种解决方案是“逆透视”数据。也就是说,对每个用户“进入”和“退出”活动状态进行增量计数,然后进行累加:
with d as ( -- calculate the dates with "ins" and "outs"
select user, date, +1 as inc
from table
union all
select user, date + interval '30 day', -1 as inc
from table
),
d2 as ( -- accumulate to get the net actives per day
select date, user, sum(inc) as change_on_day,
sum(sum(inc)) over (partition by user order by date) as running_inc
from d
group by date, user
),
d3 as ( -- summarize into active periods
select user, min(date) as start_date, max(date) as end_date
from (select d2.*,
sum(case when running_inc = 0 then 1 else 0 end) over (partition by user order by date) as active_period
from d2
) d2
where running_inc > 0
group by user
)
select d.date, count(d3.user)
from (select distinct date from table) d left join
d3
on d.date >= start_date and d.date < end_date
group by d.date;
我有 table 个用户,他们在给定日期触发了多少事件:
DATE | USERID | EVENTS |
---|---|---|
2021-08-27 | 1 | 5 |
2021-07-25 | 1 | 7 |
2021-07-23 | 2 | 3 |
2021-07-20 | 3 | 9 |
2021-06-22 | 1 | 9 |
2021-05-05 | 1 | 4 |
2021-05-05 | 2 | 2 |
2021-05-05 | 3 | 6 |
2021-05-05 | 4 | 8 |
2021-05-05 | 5 | 1 |
我想创建一个 table 显示每个日期的活跃用户数量,活跃用户被定义为在给定日期或之前 30 天内的任何一天触发事件的人。
DATE | ACTIVE_USERS |
---|---|
2021-08-27 | 1 |
2021-07-25 | 3 |
2021-07-23 | 2 |
2021-07-20 | 2 |
2021-06-22 | 1 |
2021-05-05 | 5 |
我尝试了以下查询,它只返回在指定日期活跃的用户:
SELECT COUNT(DISTINCT USERID), DATE
FROM table
WHERE DATE >= (CURRENT_DATE() - interval '30 days')
GROUP BY 2 ORDER BY 2 DESC;
我也尝试使用 window 函数,但似乎最终得到了相同的结果:
SELECT
DATE,
SUM(ACTIVE_USERS) AS ACTIVE_USERS
FROM
(
SELECT
DATE,
CASE
WHEN SUM(EVENTS) OVER (PARTITION BY USERID ORDER BY DATE ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) >= 1 THEN 1
ELSE 0
END AS ACTIVE_USERS
FROM table
)
GROUP BY 1
ORDER BY 1
我在 Snowflake 上使用 SQL:ANSI。任何建议将不胜感激。
作为 window 函数,这很棘手 -- 因为 count(distinct)
是不允许的。您可以使用自连接:
select t1.date, count(distinct t2.userid)
from table t join
table t2
on t2.date <= t.date and
t2.date > t.date - interval '30 day'
group by t1.date;
但是,这可能很昂贵。一种解决方案是“逆透视”数据。也就是说,对每个用户“进入”和“退出”活动状态进行增量计数,然后进行累加:
with d as ( -- calculate the dates with "ins" and "outs"
select user, date, +1 as inc
from table
union all
select user, date + interval '30 day', -1 as inc
from table
),
d2 as ( -- accumulate to get the net actives per day
select date, user, sum(inc) as change_on_day,
sum(sum(inc)) over (partition by user order by date) as running_inc
from d
group by date, user
),
d3 as ( -- summarize into active periods
select user, min(date) as start_date, max(date) as end_date
from (select d2.*,
sum(case when running_inc = 0 then 1 else 0 end) over (partition by user order by date) as active_period
from d2
) d2
where running_inc > 0
group by user
)
select d.date, count(d3.user)
from (select distinct date from table) d left join
d3
on d.date >= start_date and d.date < end_date
group by d.date;