如何显示来自同一用户的时间间隔为 2 分钟或更短时间的事件?
How to show events with 2 or less minutes between them from the same user?
假设我有一个 table,其中包含事件:
id user date amount
1 1 29.10.2019 16:35:01 10
2 1 29.10.2019 16:35:29 15
3 2 29.10.2019 16:48:29 12
4 2 29.10.2019 16:55:44 14
我希望看到按用户和日期排序的所有事件,例如 1 和 2(它们之间的时间不到 2 分钟;它们来自同一用户)。
我已经尝试过的:
SELECT *
FROM (
SELECT id, user, amount,
datediff(MINUTE, lag(date) OVER (ORDER BY user, d_date), date)
AS since_past_one
FROM events
) e
where since_past_one <> 0
and since_past_one <= 2
order by user, date
问题是,即使已经是不同用户的事件,延迟也会产生价值。
我希望看到的结果是:
id user date amount
1 1 29.10.2019 16:35:01 10
2 1 29.10.2019 16:35:29 15
我认为你错过了 partition by
论点
datediff(MINUTE, lag(date) OVER (PARTITION BY user ORDER BY to_qw, d_date), date)
然后让我们有更多的行和这样的模式 - 每当我们有超过 2 秒的间隔时,我都会放一个空行:
WITH input(id,usr,ts,amount) AS (
SELECT 1,1,TIMESTAMP '2019-10-29 16:35:00.0',10.00
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:01.5',10.26
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:03.0',10.52
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:04.5',10.78
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:06.0',11.03
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:07.5',11.29
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:09.0',11.55
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:10.5',11.81
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:18.0',13.10
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:19.5',13.36
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:21.0',13.62
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:27.0',14.66
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:28.5',14.91
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:00.0',12.00
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:01.5',12.10
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:03.0',12.21
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:04.5',12.31
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:12.0',12.83
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:13.5',12.93
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:15.0',13.03
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:16.5',13.14
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:24.0',13.66
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:25.5',13.76
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:27.0',13.86
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:28.5',13.97
)
SELECT * FROM input;
给我:
id | usr | ts | amount
----+-----+-----------------------+--------
1 | 1 | 2019-10-29 16:35:00 | 10.00
1 | 1 | 2019-10-29 16:35:01.5 | 10.26
1 | 1 | 2019-10-29 16:35:03 | 10.52
1 | 1 | 2019-10-29 16:35:04.5 | 10.78
1 | 1 | 2019-10-29 16:35:06 | 11.03
1 | 1 | 2019-10-29 16:35:07.5 | 11.29
1 | 1 | 2019-10-29 16:35:09 | 11.55
1 | 1 | 2019-10-29 16:35:10.5 | 11.81
1 | 1 | 2019-10-29 16:35:18 | 13.10
1 | 1 | 2019-10-29 16:35:19.5 | 13.36
1 | 1 | 2019-10-29 16:35:21 | 13.62
1 | 1 | 2019-10-29 16:35:27 | 14.66
1 | 1 | 2019-10-29 16:35:28.5 | 14.91
3 | 2 | 2019-10-29 16:35:00 | 12.00
3 | 2 | 2019-10-29 16:35:01.5 | 12.10
3 | 2 | 2019-10-29 16:35:03 | 12.21
3 | 2 | 2019-10-29 16:35:04.5 | 12.31
3 | 2 | 2019-10-29 16:35:12 | 12.83
3 | 2 | 2019-10-29 16:35:13.5 | 12.93
3 | 2 | 2019-10-29 16:35:15 | 13.03
3 | 2 | 2019-10-29 16:35:16.5 | 13.14
3 | 2 | 2019-10-29 16:35:24 | 13.66
3 | 2 | 2019-10-29 16:35:25.5 | 13.76
3 | 2 | 2019-10-29 16:35:27 | 13.86
3 | 2 | 2019-10-29 16:35:28.5 | 13.97
所以对于同一用户,我们希望每组行的开始行的时间间隔不超过 2 秒。 Vertica 可以识别此类组。该过程一般称为"sessionization"。我们有 Vertica OLAP 函数 CONDITIONAL_TRUE_EVENT() 可以为我们做到这一点:它从每个 PARTITION 的 0 开始,每次括号中的布尔表达式为真时递增 1。
SELECT
CONDITIONAL_TRUE_EVENT(ts - LAG(ts) > INTERVAL '2000 msec') OVER(
PARTITION BY usr ORDER BY ts
) AS session_id
, *
FROM input
给我们:
session_id | id | usr | ts | amount
------------+----+-----+-----------------------+--------
0 | 1 | 1 | 2019-10-29 16:35:00 | 10.00
0 | 1 | 1 | 2019-10-29 16:35:01.5 | 10.26
0 | 1 | 1 | 2019-10-29 16:35:03 | 10.52
0 | 1 | 1 | 2019-10-29 16:35:04.5 | 10.78
0 | 1 | 1 | 2019-10-29 16:35:06 | 11.03
0 | 1 | 1 | 2019-10-29 16:35:07.5 | 11.29
0 | 1 | 1 | 2019-10-29 16:35:09 | 11.55
0 | 1 | 1 | 2019-10-29 16:35:10.5 | 11.81
1 | 1 | 1 | 2019-10-29 16:35:18 | 13.10
1 | 1 | 1 | 2019-10-29 16:35:19.5 | 13.36
1 | 1 | 1 | 2019-10-29 16:35:21 | 13.62
2 | 1 | 1 | 2019-10-29 16:35:27 | 14.66
2 | 1 | 1 | 2019-10-29 16:35:28.5 | 14.91
0 | 3 | 2 | 2019-10-29 16:35:00 | 12.00
0 | 3 | 2 | 2019-10-29 16:35:01.5 | 12.10
0 | 3 | 2 | 2019-10-29 16:35:03 | 12.21
0 | 3 | 2 | 2019-10-29 16:35:04.5 | 12.31
1 | 3 | 2 | 2019-10-29 16:35:12 | 12.83
1 | 3 | 2 | 2019-10-29 16:35:13.5 | 12.93
1 | 3 | 2 | 2019-10-29 16:35:15 | 13.03
1 | 3 | 2 | 2019-10-29 16:35:16.5 | 13.14
2 | 3 | 2 | 2019-10-29 16:35:24 | 13.66
2 | 3 | 2 | 2019-10-29 16:35:25.5 | 13.76
2 | 3 | 2 | 2019-10-29 16:35:27 | 13.86
2 | 3 | 2 | 2019-10-29 16:35:28.5 | 13.97
并且,为了获得每个组的第一行,我们使用 Vertica 特定的 分析 LIMIT 子句 :
WITH
with_sess_id AS (
SELECT
CONDITIONAL_TRUE_EVENT(ts - LAG(ts) > INTERVAL '2000 msec') OVER(
PARTITION BY usr ORDER BY ts
) AS session_id
, *
FROM input
)
SELECT
id
, usr
, ts
, amount
FROM with_sess_id
LIMIT 1 OVER(PARTITION BY usr,session_id ORDER BY ts);
你得到:
id | usr | ts | amount
----+-----+---------------------+--------
1 | 1 | 2019-10-29 16:35:00 | 10.00
1 | 1 | 2019-10-29 16:35:18 | 13.10
1 | 1 | 2019-10-29 16:35:27 | 14.66
3 | 2 | 2019-10-29 16:35:00 | 12.00
3 | 2 | 2019-10-29 16:35:12 | 12.83
3 | 2 | 2019-10-29 16:35:24 | 13.66
如果我很好地理解了您的最后一个问题,您希望获得我们在上面确定的每个会话的平均行数,以及我们在上面确定的每个会话的平均数量。那就是,如果我理解你的问题的话:
WITH
with_sess_id AS (
SELECT
CONDITIONAL_TRUE_EVENT(ts - LAG(ts) > INTERVAL '2000 msec') OVER(
PARTITION BY usr ORDER BY ts
) AS session_id
, *
FROM input
)
,
session_summary AS (
SELECT
usr
, session_id
, COUNT(*) AS rows_per_session
, AVG(amount) AS avg_amt_per_session
FROM with_sess_id
GROUP BY 1,2
-- this returns:
-- usr | session_id | rows_per_session | avg_amt_per_session
-- -----+------------+------------------+---------------------
-- 1 | 0 | 8 | 10.905
-- 1 | 1 | 3 | 13.36
-- 1 | 2 | 2 | 14.785
-- 2 | 0 | 4 | 12.155
-- 2 | 1 | 4 | 12.9825
-- 2 | 2 | 4 | 13.8125
)
SELECT
AVG(rows_per_session) AS avg_rows_per_session
, AVG(avg_amt_per_session) AS avg_avg_amount_per_session
FROM session_summary;
avg_rows_per_session | avg_avg_amount_per_session
----------------------+----------------------------
4.16666666666667 | 13
假设我有一个 table,其中包含事件:
id user date amount
1 1 29.10.2019 16:35:01 10
2 1 29.10.2019 16:35:29 15
3 2 29.10.2019 16:48:29 12
4 2 29.10.2019 16:55:44 14
我希望看到按用户和日期排序的所有事件,例如 1 和 2(它们之间的时间不到 2 分钟;它们来自同一用户)。
我已经尝试过的:
SELECT *
FROM (
SELECT id, user, amount,
datediff(MINUTE, lag(date) OVER (ORDER BY user, d_date), date)
AS since_past_one
FROM events
) e
where since_past_one <> 0
and since_past_one <= 2
order by user, date
问题是,即使已经是不同用户的事件,延迟也会产生价值。
我希望看到的结果是:
id user date amount
1 1 29.10.2019 16:35:01 10
2 1 29.10.2019 16:35:29 15
我认为你错过了 partition by
论点
datediff(MINUTE, lag(date) OVER (PARTITION BY user ORDER BY to_qw, d_date), date)
然后让我们有更多的行和这样的模式 - 每当我们有超过 2 秒的间隔时,我都会放一个空行:
WITH input(id,usr,ts,amount) AS (
SELECT 1,1,TIMESTAMP '2019-10-29 16:35:00.0',10.00
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:01.5',10.26
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:03.0',10.52
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:04.5',10.78
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:06.0',11.03
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:07.5',11.29
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:09.0',11.55
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:10.5',11.81
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:18.0',13.10
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:19.5',13.36
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:21.0',13.62
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:27.0',14.66
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:28.5',14.91
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:00.0',12.00
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:01.5',12.10
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:03.0',12.21
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:04.5',12.31
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:12.0',12.83
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:13.5',12.93
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:15.0',13.03
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:16.5',13.14
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:24.0',13.66
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:25.5',13.76
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:27.0',13.86
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:28.5',13.97
)
SELECT * FROM input;
给我:
id | usr | ts | amount
----+-----+-----------------------+--------
1 | 1 | 2019-10-29 16:35:00 | 10.00
1 | 1 | 2019-10-29 16:35:01.5 | 10.26
1 | 1 | 2019-10-29 16:35:03 | 10.52
1 | 1 | 2019-10-29 16:35:04.5 | 10.78
1 | 1 | 2019-10-29 16:35:06 | 11.03
1 | 1 | 2019-10-29 16:35:07.5 | 11.29
1 | 1 | 2019-10-29 16:35:09 | 11.55
1 | 1 | 2019-10-29 16:35:10.5 | 11.81
1 | 1 | 2019-10-29 16:35:18 | 13.10
1 | 1 | 2019-10-29 16:35:19.5 | 13.36
1 | 1 | 2019-10-29 16:35:21 | 13.62
1 | 1 | 2019-10-29 16:35:27 | 14.66
1 | 1 | 2019-10-29 16:35:28.5 | 14.91
3 | 2 | 2019-10-29 16:35:00 | 12.00
3 | 2 | 2019-10-29 16:35:01.5 | 12.10
3 | 2 | 2019-10-29 16:35:03 | 12.21
3 | 2 | 2019-10-29 16:35:04.5 | 12.31
3 | 2 | 2019-10-29 16:35:12 | 12.83
3 | 2 | 2019-10-29 16:35:13.5 | 12.93
3 | 2 | 2019-10-29 16:35:15 | 13.03
3 | 2 | 2019-10-29 16:35:16.5 | 13.14
3 | 2 | 2019-10-29 16:35:24 | 13.66
3 | 2 | 2019-10-29 16:35:25.5 | 13.76
3 | 2 | 2019-10-29 16:35:27 | 13.86
3 | 2 | 2019-10-29 16:35:28.5 | 13.97
所以对于同一用户,我们希望每组行的开始行的时间间隔不超过 2 秒。 Vertica 可以识别此类组。该过程一般称为"sessionization"。我们有 Vertica OLAP 函数 CONDITIONAL_TRUE_EVENT() 可以为我们做到这一点:它从每个 PARTITION 的 0 开始,每次括号中的布尔表达式为真时递增 1。
SELECT
CONDITIONAL_TRUE_EVENT(ts - LAG(ts) > INTERVAL '2000 msec') OVER(
PARTITION BY usr ORDER BY ts
) AS session_id
, *
FROM input
给我们:
session_id | id | usr | ts | amount
------------+----+-----+-----------------------+--------
0 | 1 | 1 | 2019-10-29 16:35:00 | 10.00
0 | 1 | 1 | 2019-10-29 16:35:01.5 | 10.26
0 | 1 | 1 | 2019-10-29 16:35:03 | 10.52
0 | 1 | 1 | 2019-10-29 16:35:04.5 | 10.78
0 | 1 | 1 | 2019-10-29 16:35:06 | 11.03
0 | 1 | 1 | 2019-10-29 16:35:07.5 | 11.29
0 | 1 | 1 | 2019-10-29 16:35:09 | 11.55
0 | 1 | 1 | 2019-10-29 16:35:10.5 | 11.81
1 | 1 | 1 | 2019-10-29 16:35:18 | 13.10
1 | 1 | 1 | 2019-10-29 16:35:19.5 | 13.36
1 | 1 | 1 | 2019-10-29 16:35:21 | 13.62
2 | 1 | 1 | 2019-10-29 16:35:27 | 14.66
2 | 1 | 1 | 2019-10-29 16:35:28.5 | 14.91
0 | 3 | 2 | 2019-10-29 16:35:00 | 12.00
0 | 3 | 2 | 2019-10-29 16:35:01.5 | 12.10
0 | 3 | 2 | 2019-10-29 16:35:03 | 12.21
0 | 3 | 2 | 2019-10-29 16:35:04.5 | 12.31
1 | 3 | 2 | 2019-10-29 16:35:12 | 12.83
1 | 3 | 2 | 2019-10-29 16:35:13.5 | 12.93
1 | 3 | 2 | 2019-10-29 16:35:15 | 13.03
1 | 3 | 2 | 2019-10-29 16:35:16.5 | 13.14
2 | 3 | 2 | 2019-10-29 16:35:24 | 13.66
2 | 3 | 2 | 2019-10-29 16:35:25.5 | 13.76
2 | 3 | 2 | 2019-10-29 16:35:27 | 13.86
2 | 3 | 2 | 2019-10-29 16:35:28.5 | 13.97
并且,为了获得每个组的第一行,我们使用 Vertica 特定的 分析 LIMIT 子句 :
WITH
with_sess_id AS (
SELECT
CONDITIONAL_TRUE_EVENT(ts - LAG(ts) > INTERVAL '2000 msec') OVER(
PARTITION BY usr ORDER BY ts
) AS session_id
, *
FROM input
)
SELECT
id
, usr
, ts
, amount
FROM with_sess_id
LIMIT 1 OVER(PARTITION BY usr,session_id ORDER BY ts);
你得到:
id | usr | ts | amount
----+-----+---------------------+--------
1 | 1 | 2019-10-29 16:35:00 | 10.00
1 | 1 | 2019-10-29 16:35:18 | 13.10
1 | 1 | 2019-10-29 16:35:27 | 14.66
3 | 2 | 2019-10-29 16:35:00 | 12.00
3 | 2 | 2019-10-29 16:35:12 | 12.83
3 | 2 | 2019-10-29 16:35:24 | 13.66
如果我很好地理解了您的最后一个问题,您希望获得我们在上面确定的每个会话的平均行数,以及我们在上面确定的每个会话的平均数量。那就是,如果我理解你的问题的话:
WITH
with_sess_id AS (
SELECT
CONDITIONAL_TRUE_EVENT(ts - LAG(ts) > INTERVAL '2000 msec') OVER(
PARTITION BY usr ORDER BY ts
) AS session_id
, *
FROM input
)
,
session_summary AS (
SELECT
usr
, session_id
, COUNT(*) AS rows_per_session
, AVG(amount) AS avg_amt_per_session
FROM with_sess_id
GROUP BY 1,2
-- this returns:
-- usr | session_id | rows_per_session | avg_amt_per_session
-- -----+------------+------------------+---------------------
-- 1 | 0 | 8 | 10.905
-- 1 | 1 | 3 | 13.36
-- 1 | 2 | 2 | 14.785
-- 2 | 0 | 4 | 12.155
-- 2 | 1 | 4 | 12.9825
-- 2 | 2 | 4 | 13.8125
)
SELECT
AVG(rows_per_session) AS avg_rows_per_session
, AVG(avg_amt_per_session) AS avg_avg_amount_per_session
FROM session_summary;
avg_rows_per_session | avg_avg_amount_per_session
----------------------+----------------------------
4.16666666666667 | 13