Select 记录在 SQL 根据时间间隔内的事件计数
Select records in SQL according to count of events in a time interval
先举个我的例子table:
+---------+-----------------+------------+-----------------+---------------------+
| user_id | email | home_phone | incoming_number | date_time |
+---------+-----------------+------------+-----------------+---------------------+
| 1 | dan@dan.com | 8893432 | 5453455 | 2018-03-27 13:48:10 |
| 1 | dan@dan.com | 8893432 | 65765489 | 2018-03-27 13:47:10 |
| 1 | dan@dan.com | 8893432 | 65765489 | 2018-03-27 13:48:05 |
| 2 | sam@sam.com | 16568675 | 65658403 | 2018-03-27 13:46:05 |
| 2 | sam@sam.com | 16568675 | 57575748 | 2018-03-27 13:32:05 |
| 2 | sam@sam.com | 16568675 | 76547946 | 2018-03-27 13:43:05 |
| 3 | allen@allen.com | 12345678 | 85768576 | 2018-03-27 13:46:05 |
| 3 | allen@allen.com | 12345678 | 65658403 | 2018-03-27 13:42:05 |
| 3 | allen@allen.com | 12345678 | 76547946 | 2018-03-27 13:43:05 |
| 3 | allen@allen.com | 12345678 | 76547946 | 2018-03-27 13:20:05 |
+---------+-----------------+------------+-----------------+---------------------+
我想要完成什么?
我想 select 所有三胞胎 (user_id, email, home_phone)
在 10 分钟的时间范围内至少有 3 个不同的 incoming_number 值。
例如,在上面的 table 中,结果将仅为 (3,allen@allen.com,12345678)
。第一个用户只有两个不同的 incoming_number 值,第二个用户的时间范围 > 10 分钟
备注:
一个来电号码可以多次出现,但具有不同的 date_time 值。
每个user_id只有1封邮件,而且只有1封home_phone。
到目前为止我尝试了什么?
我想也许我应该将前 3 列视为 1 个键?也许在 incoming_number 上计数不同并以某种方式解决?没有太多想法。
什么 SQL 查询可以解决我的问题?
如果我没理解错的话,none 的小组满足两个条件:3 个不同 incoming_number-s 并且最后一次和第一次通话之间的持续时间少于 10 分钟。因此,出于说明目的,我添加了一组满足这两个条件的电子邮件 match@match.com。下面的查询在 WITH 子句中包含您的数据,以及在最终报告中将条件放在一起的所有中间结果。删除 HAVING 子句以检查不符合条件的行中的那些结果....
玩的开心
马可
WITH
input( user_id,email ,home_phone,incoming_number,date_time) AS (
SELECT 1,'dan@dan.com' , 8893432 , 5453455 ,TIMESTAMP '2018-03-27 13:48:10'
UNION ALL SELECT 1,'dan@dan.com' , 8893432 ,65765489 ,TIMESTAMP '2018-03-27 13:47:10'
UNION ALL SELECT 1,'dan@dan.com' , 8893432 ,65765489 ,TIMESTAMP '2018-03-27 13:48:05'
UNION ALL SELECT 2,'sam@sam.com' ,16568675 ,65658403 ,TIMESTAMP '2018-03-27 13:46:05'
UNION ALL SELECT 2,'sam@sam.com' ,16568675 ,57575748 ,TIMESTAMP '2018-03-27 13:32:05'
UNION ALL SELECT 2,'sam@sam.com' ,16568675 ,76547946 ,TIMESTAMP '2018-03-27 13:43:05'
UNION ALL SELECT 3,'allen@allen.com',12345678 ,85768576 ,TIMESTAMP '2018-03-27 13:46:05'
UNION ALL SELECT 3,'allen@allen.com',12345678 ,65658403 ,TIMESTAMP '2018-03-27 13:42:05'
UNION ALL SELECT 3,'allen@allen.com',12345678 ,76547946 ,TIMESTAMP '2018-03-27 13:43:05'
UNION ALL SELECT 3,'allen@allen.com',12345678 ,76547946 ,TIMESTAMP '2018-03-27 13:20:05'
UNION ALL SELECT 4,'match@match.com',62345677 ,85768576 ,TIMESTAMP '2018-03-27 13:11:05'
UNION ALL SELECT 4,'match@match.com',62345677 ,65658403 ,TIMESTAMP '2018-03-27 13:13:05'
UNION ALL SELECT 4,'match@match.com',62345677 ,76547946 ,TIMESTAMP '2018-03-27 13:18:05'
UNION ALL SELECT 4,'match@match.com',62345677 ,76547946 ,TIMESTAMP '2018-03-27 13:20:05'
)
SELECT
user_id
, email
, home_phone
, MAX(date_time) - MIN(date_time) duration
, MAX(date_time) end_ts
, MIN(date_time) start_ts
, COUNT(DISTINCT incoming_number) incoming_number_count
FROM input
GROUP BY
user_id
, email
, home_phone
HAVING MAX(date_time) - MIN(date_time) < INTERVAL '10 minutes'
AND COUNT(DISTINCT incoming_number) >=3
;
user_id|email |home_phone|duration |end_ts |start_ts |incoming_number_count
4|match@match.com|62,345,677|0 00:09:00.000000|2018-03-27 13:20:05|2018-03-27 13:11:05|
第二个答案 - 现在看到你想要的,但保留原来的答案:
在您描述的情况下,我们需要走 OLAP 路径。
我们从 date_time 列中减去第二个 date_time (使用 LAG() ),并且由于 Vertica 不支持 COUNT (DISTINCT col) OVER(),我们使用 Vertica 的特定CONDITIONAL_CHANGE_EVENT() OLAP 函数计算 incoming_number 变化的频率,如果它从未变化则得到 0,如果变化一次或两次则得到 1 和 2,如果变化则给出 3 个不同的 incoming_number-s两次:
WITH
input( user_id,email ,home_phone,incoming_number,date_time) AS (
SELECT 1,'dan@dan.com' , 8893432 , 5453455 ,TIMESTAMP '2018-03-27 13:48:10'
UNION ALL SELECT 1,'dan@dan.com' , 8893432 ,65765489 ,TIMESTAMP '2018-03-27 13:47:10'
UNION ALL SELECT 1,'dan@dan.com' , 8893432 ,65765489 ,TIMESTAMP '2018-03-27 13:48:05'
UNION ALL SELECT 2,'sam@sam.com' ,16568675 ,65658403 ,TIMESTAMP '2018-03-27 13:46:05'
UNION ALL SELECT 2,'sam@sam.com' ,16568675 ,57575748 ,TIMESTAMP '2018-03-27 13:32:05'
UNION ALL SELECT 2,'sam@sam.com' ,16568675 ,76547946 ,TIMESTAMP '2018-03-27 13:43:05'
UNION ALL SELECT 3,'allen@allen.com',12345678 ,85768576 ,TIMESTAMP '2018-03-27 13:46:05'
UNION ALL SELECT 3,'allen@allen.com',12345678 ,65658403 ,TIMESTAMP '2018-03-27 13:42:05'
UNION ALL SELECT 3,'allen@allen.com',12345678 ,76547946 ,TIMESTAMP '2018-03-27 13:43:05'
UNION ALL SELECT 3,'allen@allen.com',12345678 ,76547946 ,TIMESTAMP '2018-03-27 13:20:05'
)
,
w_filter_val AS (
SELECT
*
, date_time - LAG(date_time,2) OVER(PARTITION BY user_id ORDER BY date_time) AS time4these3
, CONDITIONAL_CHANGE_EVENT(incoming_number) OVER(PARTITION BY user_id ORDER BY incoming_number) AS count_in_nbr_minus1
FROM input
)
SELECT * FROM w_filter_val ORDER BY 1;
user_id | email | home_phone | incoming_number | date_time | time4these3 | count_in_nbr_minus1
---------+-----------------+------------+-----------------+---------------------+-------------+---------------------
1 | dan@dan.com | 8893432 | 5453455 | 2018-03-27 13:48:10 | 00:01 | 0
1 | dan@dan.com | 8893432 | 65765489 | 2018-03-27 13:47:10 | | 1
1 | dan@dan.com | 8893432 | 65765489 | 2018-03-27 13:48:05 | | 1
2 | sam@sam.com | 16568675 | 57575748 | 2018-03-27 13:32:05 | | 0
2 | sam@sam.com | 16568675 | 65658403 | 2018-03-27 13:46:05 | 00:14 | 1
2 | sam@sam.com | 16568675 | 76547946 | 2018-03-27 13:43:05 | | 2
3 | allen@allen.com | 12345678 | 65658403 | 2018-03-27 13:42:05 | | 0
3 | allen@allen.com | 12345678 | 76547946 | 2018-03-27 13:20:05 | | 1
3 | allen@allen.com | 12345678 | 76547946 | 2018-03-27 13:43:05 | 00:23 | 1
3 | allen@allen.com | 12345678 | 85768576 | 2018-03-27 13:46:05 | 00:04 | 2
最后,我们需要做的就是过滤持续时间少于 10 分钟和 3 分钟或更多的 incoming_number-s
WITH
input( user_id,email ,home_phone,incoming_number,date_time) AS (
SELECT 1,'dan@dan.com' , 8893432 , 5453455 ,TIMESTAMP '2018-03-27 13:48:10'
UNION ALL SELECT 1,'dan@dan.com' , 8893432 ,65765489 ,TIMESTAMP '2018-03-27 13:47:10'
UNION ALL SELECT 1,'dan@dan.com' , 8893432 ,65765489 ,TIMESTAMP '2018-03-27 13:48:05'
UNION ALL SELECT 2,'sam@sam.com' ,16568675 ,65658403 ,TIMESTAMP '2018-03-27 13:46:05'
UNION ALL SELECT 2,'sam@sam.com' ,16568675 ,57575748 ,TIMESTAMP '2018-03-27 13:32:05'
UNION ALL SELECT 2,'sam@sam.com' ,16568675 ,76547946 ,TIMESTAMP '2018-03-27 13:43:05'
UNION ALL SELECT 3,'allen@allen.com',12345678 ,85768576 ,TIMESTAMP '2018-03-27 13:46:05'
UNION ALL SELECT 3,'allen@allen.com',12345678 ,65658403 ,TIMESTAMP '2018-03-27 13:42:05'
UNION ALL SELECT 3,'allen@allen.com',12345678 ,76547946 ,TIMESTAMP '2018-03-27 13:43:05'
UNION ALL SELECT 3,'allen@allen.com',12345678 ,76547946 ,TIMESTAMP '2018-03-27 13:20:05'
)
,
w_filter_val AS (
SELECT
*
, date_time - LAG(date_time,2) OVER(PARTITION BY user_id ORDER BY date_time) AS time4these3
, CONDITIONAL_CHANGE_EVENT(incoming_number) OVER(PARTITION BY user_id ORDER BY incoming_number) AS count_in_nbr_minus1
FROM input
)
SELECT * FROM w_filter_val WHERE time4these3 <= '10 MINUTES' AND count_in_nbr_minus1 + 1 >= 3
;
user_id | email | home_phone | incoming_number | date_time | time4these3 | count_in_nbr_minus1
---------+-----------------+------------+-----------------+---------------------+-------------+---------------------
3 | allen@allen.com | 12345678 | 85768576 | 2018-03-27 13:46:05 | 00:04 | 2
先举个我的例子table:
+---------+-----------------+------------+-----------------+---------------------+
| user_id | email | home_phone | incoming_number | date_time |
+---------+-----------------+------------+-----------------+---------------------+
| 1 | dan@dan.com | 8893432 | 5453455 | 2018-03-27 13:48:10 |
| 1 | dan@dan.com | 8893432 | 65765489 | 2018-03-27 13:47:10 |
| 1 | dan@dan.com | 8893432 | 65765489 | 2018-03-27 13:48:05 |
| 2 | sam@sam.com | 16568675 | 65658403 | 2018-03-27 13:46:05 |
| 2 | sam@sam.com | 16568675 | 57575748 | 2018-03-27 13:32:05 |
| 2 | sam@sam.com | 16568675 | 76547946 | 2018-03-27 13:43:05 |
| 3 | allen@allen.com | 12345678 | 85768576 | 2018-03-27 13:46:05 |
| 3 | allen@allen.com | 12345678 | 65658403 | 2018-03-27 13:42:05 |
| 3 | allen@allen.com | 12345678 | 76547946 | 2018-03-27 13:43:05 |
| 3 | allen@allen.com | 12345678 | 76547946 | 2018-03-27 13:20:05 |
+---------+-----------------+------------+-----------------+---------------------+
我想要完成什么?
我想 select 所有三胞胎 (user_id, email, home_phone)
在 10 分钟的时间范围内至少有 3 个不同的 incoming_number 值。
例如,在上面的 table 中,结果将仅为 (3,allen@allen.com,12345678)
。第一个用户只有两个不同的 incoming_number 值,第二个用户的时间范围 > 10 分钟
备注: 一个来电号码可以多次出现,但具有不同的 date_time 值。
每个user_id只有1封邮件,而且只有1封home_phone。
到目前为止我尝试了什么? 我想也许我应该将前 3 列视为 1 个键?也许在 incoming_number 上计数不同并以某种方式解决?没有太多想法。
什么 SQL 查询可以解决我的问题?
如果我没理解错的话,none 的小组满足两个条件:3 个不同 incoming_number-s 并且最后一次和第一次通话之间的持续时间少于 10 分钟。因此,出于说明目的,我添加了一组满足这两个条件的电子邮件 match@match.com。下面的查询在 WITH 子句中包含您的数据,以及在最终报告中将条件放在一起的所有中间结果。删除 HAVING 子句以检查不符合条件的行中的那些结果....
玩的开心
马可
WITH
input( user_id,email ,home_phone,incoming_number,date_time) AS (
SELECT 1,'dan@dan.com' , 8893432 , 5453455 ,TIMESTAMP '2018-03-27 13:48:10'
UNION ALL SELECT 1,'dan@dan.com' , 8893432 ,65765489 ,TIMESTAMP '2018-03-27 13:47:10'
UNION ALL SELECT 1,'dan@dan.com' , 8893432 ,65765489 ,TIMESTAMP '2018-03-27 13:48:05'
UNION ALL SELECT 2,'sam@sam.com' ,16568675 ,65658403 ,TIMESTAMP '2018-03-27 13:46:05'
UNION ALL SELECT 2,'sam@sam.com' ,16568675 ,57575748 ,TIMESTAMP '2018-03-27 13:32:05'
UNION ALL SELECT 2,'sam@sam.com' ,16568675 ,76547946 ,TIMESTAMP '2018-03-27 13:43:05'
UNION ALL SELECT 3,'allen@allen.com',12345678 ,85768576 ,TIMESTAMP '2018-03-27 13:46:05'
UNION ALL SELECT 3,'allen@allen.com',12345678 ,65658403 ,TIMESTAMP '2018-03-27 13:42:05'
UNION ALL SELECT 3,'allen@allen.com',12345678 ,76547946 ,TIMESTAMP '2018-03-27 13:43:05'
UNION ALL SELECT 3,'allen@allen.com',12345678 ,76547946 ,TIMESTAMP '2018-03-27 13:20:05'
UNION ALL SELECT 4,'match@match.com',62345677 ,85768576 ,TIMESTAMP '2018-03-27 13:11:05'
UNION ALL SELECT 4,'match@match.com',62345677 ,65658403 ,TIMESTAMP '2018-03-27 13:13:05'
UNION ALL SELECT 4,'match@match.com',62345677 ,76547946 ,TIMESTAMP '2018-03-27 13:18:05'
UNION ALL SELECT 4,'match@match.com',62345677 ,76547946 ,TIMESTAMP '2018-03-27 13:20:05'
)
SELECT
user_id
, email
, home_phone
, MAX(date_time) - MIN(date_time) duration
, MAX(date_time) end_ts
, MIN(date_time) start_ts
, COUNT(DISTINCT incoming_number) incoming_number_count
FROM input
GROUP BY
user_id
, email
, home_phone
HAVING MAX(date_time) - MIN(date_time) < INTERVAL '10 minutes'
AND COUNT(DISTINCT incoming_number) >=3
;
user_id|email |home_phone|duration |end_ts |start_ts |incoming_number_count
4|match@match.com|62,345,677|0 00:09:00.000000|2018-03-27 13:20:05|2018-03-27 13:11:05|
第二个答案 - 现在看到你想要的,但保留原来的答案:
在您描述的情况下,我们需要走 OLAP 路径。
我们从 date_time 列中减去第二个 date_time (使用 LAG() ),并且由于 Vertica 不支持 COUNT (DISTINCT col) OVER(),我们使用 Vertica 的特定CONDITIONAL_CHANGE_EVENT() OLAP 函数计算 incoming_number 变化的频率,如果它从未变化则得到 0,如果变化一次或两次则得到 1 和 2,如果变化则给出 3 个不同的 incoming_number-s两次:
WITH
input( user_id,email ,home_phone,incoming_number,date_time) AS (
SELECT 1,'dan@dan.com' , 8893432 , 5453455 ,TIMESTAMP '2018-03-27 13:48:10'
UNION ALL SELECT 1,'dan@dan.com' , 8893432 ,65765489 ,TIMESTAMP '2018-03-27 13:47:10'
UNION ALL SELECT 1,'dan@dan.com' , 8893432 ,65765489 ,TIMESTAMP '2018-03-27 13:48:05'
UNION ALL SELECT 2,'sam@sam.com' ,16568675 ,65658403 ,TIMESTAMP '2018-03-27 13:46:05'
UNION ALL SELECT 2,'sam@sam.com' ,16568675 ,57575748 ,TIMESTAMP '2018-03-27 13:32:05'
UNION ALL SELECT 2,'sam@sam.com' ,16568675 ,76547946 ,TIMESTAMP '2018-03-27 13:43:05'
UNION ALL SELECT 3,'allen@allen.com',12345678 ,85768576 ,TIMESTAMP '2018-03-27 13:46:05'
UNION ALL SELECT 3,'allen@allen.com',12345678 ,65658403 ,TIMESTAMP '2018-03-27 13:42:05'
UNION ALL SELECT 3,'allen@allen.com',12345678 ,76547946 ,TIMESTAMP '2018-03-27 13:43:05'
UNION ALL SELECT 3,'allen@allen.com',12345678 ,76547946 ,TIMESTAMP '2018-03-27 13:20:05'
)
,
w_filter_val AS (
SELECT
*
, date_time - LAG(date_time,2) OVER(PARTITION BY user_id ORDER BY date_time) AS time4these3
, CONDITIONAL_CHANGE_EVENT(incoming_number) OVER(PARTITION BY user_id ORDER BY incoming_number) AS count_in_nbr_minus1
FROM input
)
SELECT * FROM w_filter_val ORDER BY 1;
user_id | email | home_phone | incoming_number | date_time | time4these3 | count_in_nbr_minus1
---------+-----------------+------------+-----------------+---------------------+-------------+---------------------
1 | dan@dan.com | 8893432 | 5453455 | 2018-03-27 13:48:10 | 00:01 | 0
1 | dan@dan.com | 8893432 | 65765489 | 2018-03-27 13:47:10 | | 1
1 | dan@dan.com | 8893432 | 65765489 | 2018-03-27 13:48:05 | | 1
2 | sam@sam.com | 16568675 | 57575748 | 2018-03-27 13:32:05 | | 0
2 | sam@sam.com | 16568675 | 65658403 | 2018-03-27 13:46:05 | 00:14 | 1
2 | sam@sam.com | 16568675 | 76547946 | 2018-03-27 13:43:05 | | 2
3 | allen@allen.com | 12345678 | 65658403 | 2018-03-27 13:42:05 | | 0
3 | allen@allen.com | 12345678 | 76547946 | 2018-03-27 13:20:05 | | 1
3 | allen@allen.com | 12345678 | 76547946 | 2018-03-27 13:43:05 | 00:23 | 1
3 | allen@allen.com | 12345678 | 85768576 | 2018-03-27 13:46:05 | 00:04 | 2
最后,我们需要做的就是过滤持续时间少于 10 分钟和 3 分钟或更多的 incoming_number-s
WITH
input( user_id,email ,home_phone,incoming_number,date_time) AS (
SELECT 1,'dan@dan.com' , 8893432 , 5453455 ,TIMESTAMP '2018-03-27 13:48:10'
UNION ALL SELECT 1,'dan@dan.com' , 8893432 ,65765489 ,TIMESTAMP '2018-03-27 13:47:10'
UNION ALL SELECT 1,'dan@dan.com' , 8893432 ,65765489 ,TIMESTAMP '2018-03-27 13:48:05'
UNION ALL SELECT 2,'sam@sam.com' ,16568675 ,65658403 ,TIMESTAMP '2018-03-27 13:46:05'
UNION ALL SELECT 2,'sam@sam.com' ,16568675 ,57575748 ,TIMESTAMP '2018-03-27 13:32:05'
UNION ALL SELECT 2,'sam@sam.com' ,16568675 ,76547946 ,TIMESTAMP '2018-03-27 13:43:05'
UNION ALL SELECT 3,'allen@allen.com',12345678 ,85768576 ,TIMESTAMP '2018-03-27 13:46:05'
UNION ALL SELECT 3,'allen@allen.com',12345678 ,65658403 ,TIMESTAMP '2018-03-27 13:42:05'
UNION ALL SELECT 3,'allen@allen.com',12345678 ,76547946 ,TIMESTAMP '2018-03-27 13:43:05'
UNION ALL SELECT 3,'allen@allen.com',12345678 ,76547946 ,TIMESTAMP '2018-03-27 13:20:05'
)
,
w_filter_val AS (
SELECT
*
, date_time - LAG(date_time,2) OVER(PARTITION BY user_id ORDER BY date_time) AS time4these3
, CONDITIONAL_CHANGE_EVENT(incoming_number) OVER(PARTITION BY user_id ORDER BY incoming_number) AS count_in_nbr_minus1
FROM input
)
SELECT * FROM w_filter_val WHERE time4these3 <= '10 MINUTES' AND count_in_nbr_minus1 + 1 >= 3
;
user_id | email | home_phone | incoming_number | date_time | time4these3 | count_in_nbr_minus1
---------+-----------------+------------+-----------------+---------------------+-------------+---------------------
3 | allen@allen.com | 12345678 | 85768576 | 2018-03-27 13:46:05 | 00:04 | 2