使用 window 函数计算每个事件行在给定时间间隔内事件的先前发生次数
Counting preceding occurences of an event within a given interval for each event row with a window function
我有 table 存储用户发生的事件,如 http://sqlfiddle.com/#!15/2b559/2/0
中所示
event_id(integer)
user_id(integer)
event_type(integer)
timestamp(timestamp)
数据样本如下所示:
+-----------+----------+-------------+----------------------------+
| event_id | user_id | event_type | timestamp |
+-----------+----------+-------------+----------------------------+
| 1 | 1 | 1 | January, 01 2015 00:00:00 |
| 2 | 1 | 1 | January, 10 2015 00:00:00 |
| 3 | 1 | 1 | January, 20 2015 00:00:00 |
| 4 | 1 | 1 | January, 30 2015 00:00:00 |
| 5 | 1 | 1 | February, 10 2015 00:00:00 |
| 6 | 1 | 1 | February, 21 2015 00:00:00 |
| 7 | 1 | 1 | February, 22 2015 00:00:00 |
+-----------+----------+-------------+----------------------------+
我想为每个事件获取事件发生前 30 天内同一用户和同一 event_type 的事件数。
它应该如下所示:
+-----------+----------+-------------+-----------------------------+-------+
| event_id | user_id | event_type | timestamp | count |
+-----------+----------+-------------+-----------------------------+-------+
| 1 | 1 | 1 | January, 01 2015 00:00:00 | 1 |
| 2 | 1 | 1 | January, 10 2015 00:00:00 | 2 |
| 3 | 1 | 1 | January, 20 2015 00:00:00 | 3 |
| 4 | 1 | 1 | January, 30 2015 00:00:00 | 4 |
| 5 | 1 | 1 | February, 10 2015 00:00:00 | 3 |
| 6 | 1 | 1 | February, 21 2015 00:00:00 | 3 |
| 7 | 1 | 1 | February, 22 2015 00:00:00 | 4 |
+-----------+----------+-------------+-----------------------------+-------+
table 包含数百万行,因此我无法按照 @jpw 在下面的答案中建议的那样使用相关子查询。
到目前为止,我通过使用以下查询设法获得了之前发生的具有相同 user_id 和相同 event_id 的事件总数:
SELECT event_id, user_id,event_type,"timestamp",
COUNT(event_type) OVER w
FROM events
WINDOW w AS (PARTITION BY user_id,event_type ORDER BY timestamp
ROWS UNBOUNDED PRECEDING);
结果如下:
+-----------+----------+-------------+-----------------------------+-------+
| event_id | user_id | event_type | timestamp | count |
+-----------+----------+-------------+-----------------------------+-------+
| 1 | 1 | 1 | January, 01 2015 00:00:00 | 1 |
| 2 | 1 | 1 | January, 10 2015 00:00:00 | 2 |
| 3 | 1 | 1 | January, 20 2015 00:00:00 | 3 |
| 4 | 1 | 1 | January, 30 2015 00:00:00 | 4 |
| 5 | 1 | 1 | February, 10 2015 00:00:00 | 5 |
| 6 | 1 | 1 | February, 21 2015 00:00:00 | 6 |
| 7 | 1 | 1 | February, 22 2015 00:00:00 | 7 |
+-----------+----------+-------------+-----------------------------+-------+
您知道是否有办法更改 window 帧规范或 COUNT 函数,以便仅返回 x 天内发生的事件数?
第二次,我想排除重复事件,即相同的 event_type 和相同的时间戳。
也许您已经知道如何使用子查询来解决这个问题,并且正在专门询问使用 window 函数的解决方案,如果是这样的话,这个答案可能因此无效,但是如果您感兴趣的话在任何可能的解决方案中,使用相关子查询很容易解决这个问题,尽管我怀疑性能可能很差:
select
event_id, user_id,event_type,"timestamp",
(
select count(distinct timestamp)
from events
where timestamp >= e.timestamp - interval '30 days'
and timestamp <= e.timestamp
and user_id = e.user_id
and event_type = e.event_type
group by event_type, user_id
) as "count"
FROM events e
order by event_id;
这很笨拙,但很管用。 CTE 的性能可能比@jpw 的计数相关子查询差。
WITH ding AS (
SELECT user_id, event_type , ztimestamp
, row_number() OVER( PARTITION BY user_id, event_type
ORDER BY ztimestamp) AS rnk
FROM events
)
SELECT d1.*
, 1+ d1.rnk - d0.rnk AS diff
FROM ding d1
JOIN ding d0 USING (user_id,event_type)
WHERE d1.ztimestamp >= d0.ztimestamp
AND d1.ztimestamp < d0.ztimestamp + '30 days'::interval
AND NOT EXISTS (
SELECT *
FROM ding nx
WHERE nx.user_id = d0.user_id
AND nx.event_type = d0.event_type
AND nx.ztimestamp < d0.ztimestamp
AND nx.ztimestamp > d1.ztimestamp - '30 days'::interval
)
;
我找到了一个有效的请求:
SELECT toto.event_id,toto.user_id,toto.event_type,toto.lv as time,COUNT(*)
FROM(
SELECT e.event_id, e.user_id,e.event_type,"timestamp",
last_value("timestamp") OVER w as lv,
unnest(array_agg(e."timestamp") OVER w) as agg
FROM events e
WINDOW w AS (PARTITION BY e.user_id,e.event_type ORDER BY e."timestamp"
ROWS UNBOUNDED PRECEDING)) AS toto
WHERE toto.agg >= toto.lv - interval '30 days'
GROUP by event_id,user_id,event_type,lv;
在我有 1000 行样本的开发机器上,执行需要 49 毫秒。对于 10000 行样本,它需要 8277 毫秒,而 @jpw 的查询需要 6720 毫秒,使用时间戳上的索引。对于 50000 行的样本,两个查询都需要超过 100 秒,所以我没有测试 :)
我在 duplicate question on dba.SE 下提供了更详细的答案加上 fiddle。
基本上:
CREATE INDEX events_fast_idx ON events (user_id, event_type, ts);
并且:
SELECT *
FROM events e
, LATERAL (
SELECT count(*) AS ct
FROM events
WHERE user_id = e.user_id
AND event_type = e.event_type
AND ts >= e.ts - interval '30 days'
AND ts <= e.ts
) ct
ORDER BY event_id;
或者:
SELECT e.*, count(*) AS ct
FROM events e
JOIN events x USING (user_id, event_type)
WHERE x.ts >= e.ts - interval '30 days'
AND x.ts <= e.ts
GROUP BY e.event_id
ORDER BY e.event_id;
我有 table 存储用户发生的事件,如 http://sqlfiddle.com/#!15/2b559/2/0
中所示event_id(integer)
user_id(integer)
event_type(integer)
timestamp(timestamp)
数据样本如下所示:
+-----------+----------+-------------+----------------------------+
| event_id | user_id | event_type | timestamp |
+-----------+----------+-------------+----------------------------+
| 1 | 1 | 1 | January, 01 2015 00:00:00 |
| 2 | 1 | 1 | January, 10 2015 00:00:00 |
| 3 | 1 | 1 | January, 20 2015 00:00:00 |
| 4 | 1 | 1 | January, 30 2015 00:00:00 |
| 5 | 1 | 1 | February, 10 2015 00:00:00 |
| 6 | 1 | 1 | February, 21 2015 00:00:00 |
| 7 | 1 | 1 | February, 22 2015 00:00:00 |
+-----------+----------+-------------+----------------------------+
我想为每个事件获取事件发生前 30 天内同一用户和同一 event_type 的事件数。
它应该如下所示:
+-----------+----------+-------------+-----------------------------+-------+
| event_id | user_id | event_type | timestamp | count |
+-----------+----------+-------------+-----------------------------+-------+
| 1 | 1 | 1 | January, 01 2015 00:00:00 | 1 |
| 2 | 1 | 1 | January, 10 2015 00:00:00 | 2 |
| 3 | 1 | 1 | January, 20 2015 00:00:00 | 3 |
| 4 | 1 | 1 | January, 30 2015 00:00:00 | 4 |
| 5 | 1 | 1 | February, 10 2015 00:00:00 | 3 |
| 6 | 1 | 1 | February, 21 2015 00:00:00 | 3 |
| 7 | 1 | 1 | February, 22 2015 00:00:00 | 4 |
+-----------+----------+-------------+-----------------------------+-------+
table 包含数百万行,因此我无法按照 @jpw 在下面的答案中建议的那样使用相关子查询。
到目前为止,我通过使用以下查询设法获得了之前发生的具有相同 user_id 和相同 event_id 的事件总数:
SELECT event_id, user_id,event_type,"timestamp",
COUNT(event_type) OVER w
FROM events
WINDOW w AS (PARTITION BY user_id,event_type ORDER BY timestamp
ROWS UNBOUNDED PRECEDING);
结果如下:
+-----------+----------+-------------+-----------------------------+-------+
| event_id | user_id | event_type | timestamp | count |
+-----------+----------+-------------+-----------------------------+-------+
| 1 | 1 | 1 | January, 01 2015 00:00:00 | 1 |
| 2 | 1 | 1 | January, 10 2015 00:00:00 | 2 |
| 3 | 1 | 1 | January, 20 2015 00:00:00 | 3 |
| 4 | 1 | 1 | January, 30 2015 00:00:00 | 4 |
| 5 | 1 | 1 | February, 10 2015 00:00:00 | 5 |
| 6 | 1 | 1 | February, 21 2015 00:00:00 | 6 |
| 7 | 1 | 1 | February, 22 2015 00:00:00 | 7 |
+-----------+----------+-------------+-----------------------------+-------+
您知道是否有办法更改 window 帧规范或 COUNT 函数,以便仅返回 x 天内发生的事件数?
第二次,我想排除重复事件,即相同的 event_type 和相同的时间戳。
也许您已经知道如何使用子查询来解决这个问题,并且正在专门询问使用 window 函数的解决方案,如果是这样的话,这个答案可能因此无效,但是如果您感兴趣的话在任何可能的解决方案中,使用相关子查询很容易解决这个问题,尽管我怀疑性能可能很差:
select
event_id, user_id,event_type,"timestamp",
(
select count(distinct timestamp)
from events
where timestamp >= e.timestamp - interval '30 days'
and timestamp <= e.timestamp
and user_id = e.user_id
and event_type = e.event_type
group by event_type, user_id
) as "count"
FROM events e
order by event_id;
这很笨拙,但很管用。 CTE 的性能可能比@jpw 的计数相关子查询差。
WITH ding AS (
SELECT user_id, event_type , ztimestamp
, row_number() OVER( PARTITION BY user_id, event_type
ORDER BY ztimestamp) AS rnk
FROM events
)
SELECT d1.*
, 1+ d1.rnk - d0.rnk AS diff
FROM ding d1
JOIN ding d0 USING (user_id,event_type)
WHERE d1.ztimestamp >= d0.ztimestamp
AND d1.ztimestamp < d0.ztimestamp + '30 days'::interval
AND NOT EXISTS (
SELECT *
FROM ding nx
WHERE nx.user_id = d0.user_id
AND nx.event_type = d0.event_type
AND nx.ztimestamp < d0.ztimestamp
AND nx.ztimestamp > d1.ztimestamp - '30 days'::interval
)
;
我找到了一个有效的请求:
SELECT toto.event_id,toto.user_id,toto.event_type,toto.lv as time,COUNT(*)
FROM(
SELECT e.event_id, e.user_id,e.event_type,"timestamp",
last_value("timestamp") OVER w as lv,
unnest(array_agg(e."timestamp") OVER w) as agg
FROM events e
WINDOW w AS (PARTITION BY e.user_id,e.event_type ORDER BY e."timestamp"
ROWS UNBOUNDED PRECEDING)) AS toto
WHERE toto.agg >= toto.lv - interval '30 days'
GROUP by event_id,user_id,event_type,lv;
在我有 1000 行样本的开发机器上,执行需要 49 毫秒。对于 10000 行样本,它需要 8277 毫秒,而 @jpw 的查询需要 6720 毫秒,使用时间戳上的索引。对于 50000 行的样本,两个查询都需要超过 100 秒,所以我没有测试 :)
我在 duplicate question on dba.SE 下提供了更详细的答案加上 fiddle。
基本上:
CREATE INDEX events_fast_idx ON events (user_id, event_type, ts);
并且:
SELECT *
FROM events e
, LATERAL (
SELECT count(*) AS ct
FROM events
WHERE user_id = e.user_id
AND event_type = e.event_type
AND ts >= e.ts - interval '30 days'
AND ts <= e.ts
) ct
ORDER BY event_id;
或者:
SELECT e.*, count(*) AS ct
FROM events e
JOIN events x USING (user_id, event_type)
WHERE x.ts >= e.ts - interval '30 days'
AND x.ts <= e.ts
GROUP BY e.event_id
ORDER BY e.event_id;