使用 window 函数计算每个事件行在给定时间间隔内事件的先前发生次数

Counting preceding occurences of an event within a given interval for each event row with a window function

我有 table 存储用户发生的事件,如 http://sqlfiddle.com/#!15/2b559/2/0

中所示
event_id(integer)
user_id(integer)
event_type(integer)
timestamp(timestamp)

数据样本如下所示:

+-----------+----------+-------------+----------------------------+
| event_id  | user_id  | event_type  |         timestamp          |
+-----------+----------+-------------+----------------------------+
|        1  |       1  |          1  | January, 01 2015 00:00:00  |
|        2  |       1  |          1  | January, 10 2015 00:00:00  |
|        3  |       1  |          1  | January, 20 2015 00:00:00  |
|        4  |       1  |          1  | January, 30 2015 00:00:00  |
|        5  |       1  |          1  | February, 10 2015 00:00:00 |
|        6  |       1  |          1  | February, 21 2015 00:00:00 |
|        7  |       1  |          1  | February, 22 2015 00:00:00 |
+-----------+----------+-------------+----------------------------+

我想为每个事件获取事件发生前 30 天内同一用户和同一 event_type 的事件数。

它应该如下所示:

+-----------+----------+-------------+-----------------------------+-------+
| event_id  | user_id  | event_type  |         timestamp           | count |
+-----------+----------+-------------+-----------------------------+-------+
|        1  |       1  |          1  | January, 01 2015 00:00:00   |     1 |
|        2  |       1  |          1  | January, 10 2015 00:00:00   |     2 |
|        3  |       1  |          1  | January, 20 2015 00:00:00   |     3 |
|        4  |       1  |          1  | January, 30 2015 00:00:00   |     4 |
|        5  |       1  |          1  | February, 10 2015 00:00:00  |     3 |
|        6  |       1  |          1  | February, 21 2015 00:00:00  |     3 |
|        7  |       1  |          1  | February, 22 2015 00:00:00  |     4 |
+-----------+----------+-------------+-----------------------------+-------+

table 包含数百万行,因此我无法按照 @jpw 在下面的答案中建议的那样使用相关子查询。

到目前为止,我通过使用以下查询设法获得了之前发生的具有相同 user_id 和相同 event_id 的事件总数:

SELECT event_id, user_id,event_type,"timestamp",
COUNT(event_type) OVER w
FROM events
WINDOW w AS (PARTITION BY user_id,event_type ORDER BY timestamp
ROWS UNBOUNDED PRECEDING);

结果如下:

+-----------+----------+-------------+-----------------------------+-------+
| event_id  | user_id  | event_type  |         timestamp           | count |
+-----------+----------+-------------+-----------------------------+-------+
|        1  |       1  |          1  | January, 01 2015 00:00:00   |     1 |
|        2  |       1  |          1  | January, 10 2015 00:00:00   |     2 |
|        3  |       1  |          1  | January, 20 2015 00:00:00   |     3 |
|        4  |       1  |          1  | January, 30 2015 00:00:00   |     4 |
|        5  |       1  |          1  | February, 10 2015 00:00:00  |     5 |
|        6  |       1  |          1  | February, 21 2015 00:00:00  |     6 |
|        7  |       1  |          1  | February, 22 2015 00:00:00  |     7 |
+-----------+----------+-------------+-----------------------------+-------+

您知道是否有办法更改 window 帧规范或 COUNT 函数,以便仅返回 x 天内发生的事件数?

第二次,我想排除重复事件,即相同的 event_type 和相同的时间戳。

也许您已经知道如何使用子查询来解决这个问题,并且正在专门询问使用 window 函数的解决方案,如果是这样的话,这个答案可能因此无效,但是如果您感兴趣的话在任何可能的解决方案中,使用相关子查询很容易解决这个问题,尽管我怀疑性能可能很差:

select 
  event_id, user_id,event_type,"timestamp", 
  (
    select count(distinct timestamp) 
    from events 
    where timestamp >= e.timestamp - interval '30 days'
    and timestamp <= e.timestamp
    and user_id = e.user_id 
    and event_type = e.event_type
    group by event_type, user_id
  ) as "count"
FROM events e
order by event_id;

Sample SQL Fiddle

这很笨拙,但很管用。 CTE 的性能可能比@jpw 的计数相关子查询差。

WITH ding AS (
  SELECT user_id, event_type , ztimestamp
        , row_number() OVER( PARTITION BY user_id, event_type
                             ORDER BY ztimestamp) AS rnk
  FROM events
  )
SELECT d1.*
        , 1+ d1.rnk - d0.rnk AS diff
FROM ding d1
JOIN ding d0 USING (user_id,event_type)
WHERE d1.ztimestamp >= d0.ztimestamp
AND d1.ztimestamp < d0.ztimestamp + '30 days'::interval
AND NOT EXISTS (
        SELECT *
        FROM ding nx
        WHERE nx.user_id = d0.user_id
        AND nx.event_type = d0.event_type
        AND nx.ztimestamp < d0.ztimestamp
        AND nx.ztimestamp > d1.ztimestamp - '30 days'::interval
        )
        ;

我找到了一个有效的请求:

SELECT toto.event_id,toto.user_id,toto.event_type,toto.lv as time,COUNT(*)
FROM(
    SELECT e.event_id, e.user_id,e.event_type,"timestamp",
    last_value("timestamp") OVER w as lv,
    unnest(array_agg(e."timestamp") OVER w) as agg
    FROM events e
    WINDOW w AS (PARTITION BY e.user_id,e.event_type ORDER BY e."timestamp"
    ROWS UNBOUNDED PRECEDING)) AS toto
WHERE toto.agg >= toto.lv - interval '30 days'
GROUP by event_id,user_id,event_type,lv;

在我有 1000 行样本的开发机器上,执行需要 49 毫秒。对于 10000 行样本,它需要 8277 毫秒,而 @jpw 的查询需要 6720 毫秒,使用时间戳上的索引。对于 50000 行的样本,两个查询都需要超过 100 秒,所以我没有测试 :)

SQL Fiddle

我在 duplicate question on dba.SE 下提供了更详细的答案加上 fiddle。

基本上:

CREATE INDEX events_fast_idx ON events (user_id, event_type, ts);

并且:

SELECT *
FROM   events e
    ,  LATERAL (
   SELECT count(*) AS ct
   FROM   events 
   WHERE  user_id    = e.user_id 
   AND    event_type = e.event_type
   AND    ts >= e.ts - interval '30 days'
   AND    ts <= e.ts
   ) ct
ORDER  BY event_id;

或者:

SELECT e.*, count(*) AS ct
FROM   events e
JOIN   events x USING (user_id, event_type)
WHERE  x.ts >= e.ts - interval '30 days'
AND    x.ts <= e.ts
GROUP  BY e.event_id
ORDER  BY e.event_id;