雪花分析查询设计
Snowflake Analytical Query Design
我有一个棘手的查询设计要求,我尝试了不同的 types/different 分析函数组合来从以下数据集中获得我的结果。我的另一个计划是编写存储过程,但是我想在改变方向之前联系专家组。
输入数据集:
需要组列的输出数据集:当会话 ID 中的会话 ID 发生变化时,如果我再次返回相同的会话 ID,我必须有一个不同的组。我尝试使用 LEAD/LAG 组合,但无法获得以下所需的输出,一种或另一种情况正在中断。
谢谢!
基本上,您想使用 lag()
查看会话 ID 何时更改。 然后你想要一个累计总和,但仅限于每个会话 id:
select t.*,
sum(case when prev_session_id = session_id then 0 else 1 end) over (
partition by pol_id, session_id
order by trans_dt
) as grouping
from (select t.*,
lag(session_id) over (partition by pol_id order by trans_dt) as prev_session_id
from t
) t;
这是群岛问题的一个棘手变体。更正常的情况是三对行被枚举为 1、2 和 3。为此,您只需从 sum()
中的 partition by
中删除 session_id
。
SQL 语言的表现力足以为复杂的需求找到声明式解决方案。
Snowflake 最近实施了 SQL 2016 标准条款:MATCH_RECOGNIZE,旨在以非常直接的方式解决此类情况。
Identifying Sequences of Rows That Match a Pattern
In some cases, you might need to identify sequences of table rows that match a pattern. For example, you might need to:
Determine which users followed a specific sequence of pages and actions on your website before opening a support ticket or making a purchase.
Find the stocks with prices that followed a V-shaped or W-shaped recovery over a period of time.
Look for patterns in sensor data that might indicate an upcoming system failure.
资料准备:
CREATE OR REPLACE TABLE t
AS
SELECT 101 SESS_ID, 1 POL_ID, '2021-04-17 09:30:00'::DATE AS Trans_dt, 1 AS VERSION_ID
UNION ALL SELECT 101 SESS_ID, 1 POL_ID, '2021-04-17 09:35:00'::DATE AS Trans_dt, 2
UNION ALL SELECT 102 SESS_ID, 1 POL_ID, '2021-04-17 09:37:00'::DATE AS Trans_dt, 3
UNION ALL SELECT 102 SESS_ID, 1 POL_ID, '2021-04-17 09:38:00'::DATE AS Trans_dt, 4
UNION ALL SELECT 101 SESS_ID, 1 POL_ID, '2021-04-17 09:39:00'::DATE AS Trans_dt, 5
UNION ALL SELECT 101 SESS_ID, 1 POL_ID, '2021-04-17 09:40:00'::DATE AS Trans_dt, 6;
查询:
SELECT *
FROM t
MATCH_RECOGNIZE (
PARTITION BY POL_ID
ORDER BY VERSION_ID
MEASURES MATCH_NUMBER() AS group_id
--,CLASSIFIER() as cks
ALL ROWS PER MATCH
PATTERN (a+b*)
DEFINE a as sess_id = FIRST_VALUE(sess_id)
,b AS sess_id != FIRST_VALUE(sess_id)
) mr
ORDER BY POL_ID, VERSION_ID;
输出:
SESS_ID POL_ID TRANS_DT VERSION_ID GROUP_ID
101 1 2021-04-17 1 1
101 1 2021-04-17 2 1
102 1 2021-04-17 3 1
102 1 2021-04-17 4 1
101 1 2021-04-17 5 2
101 1 2021-04-17 6 2
工作原理:
- 定义一个模式:
(a+b*)
这是 Perl 风格的正则表达式,a(一个或多个)b(零个或多个)
- 定义模式组件a(sess_id与组首元素相同),b(sess_id与组首元素不同)
- 定义度量
MATCH_NUMBER()
-Returns匹配的序号
- 对每个
POL_ID
执行此操作并使用 VERSION_ID
作为排序列
所以在下面,您希望 group_id
与 pol_id
的关系并不明显,所以我忽略了它。
所以使用 CTE 只是为了伪造 data
。
WITH data AS (
SELECT * FROM VALUES
(101, 1, '2021-04-17 09:30:00', 1),
(101, 1, '2021-04-17 09:35:00', 2),
(102, 1, '2021-04-17 09:37:00', 3),
(102, 1, '2021-04-17 09:38:00', 4),
(101, 1, '2021-04-17 09:39:00', 5),
(101, 1, '2021-04-17 09:40:00', 6)
v(sess_id, pol_id, trans_dt, version_id)
)
然后我想写这些操作:
SELECT *
,ROW_NUMBER() OVER (ORDER BY trans_dt) AS r1
,ROW_NUMBER() OVER (PARTITION BY sess_id ORDER BY trans_dt) AS r2
,r1- r2 as r3
,LAG(r3) OVER (PARTITION BY sess_id ORDER BY trans_dt ) as lag_r3
,IFF(lag_r3 != r3, 1, 0) as sess_edge
,SUM(sess_edge) OVER (ORDER BY trans_dt)+1 as GROUP_ID
FROM data
所以 r1
和 r2
发现 sess_id
相对于 trans_dt
存在差距,那么您需要 [=24= 的那些变化] 和 lag_r3
相对于 trans_dt
,这些是你想要计算的边缘,因此 SUM
,它是从零开始的,所以 +1
得到你想要的价值。
现在以上在 Snowflake 中无效,因此需要分层才能工作:
SELECT
*
,SUM(sess_edge) OVER (ORDER BY trans_dt)+1 as GROUP_ID
FROM (
SELECT
*
,LAG(r3) OVER (PARTITION BY sess_id ORDER BY trans_dt ) as lag_r3
,IFF(lag_r3 != r3, 1, 0) as sess_edge
FROM (
SELECT *
,ROW_NUMBER() OVER (ORDER BY trans_dt) AS r1
,ROW_NUMBER() OVER (PARTITION BY sess_id ORDER BY trans_dt) AS r2
,r1- r2 as r3
FROM data
)
)
ORDER BY trans_dt;
给出:
SESS_ID POL_ID TRANS_DT VERSION_ID R1 R2 R3 LAG_R3 SESS_EDGE GROUP_ID
101 1 2021-04-17 09:30:00 1 1 1 0 null 0 1
101 1 2021-04-17 09:35:00 2 2 2 0 0 0 1
102 1 2021-04-17 09:37:00 3 3 1 2 null 0 1
102 1 2021-04-17 09:38:00 4 4 2 2 2 0 1
101 1 2021-04-17 09:39:00 5 5 3 2 0 1 2
101 1 2021-04-17 09:40:00 6 6 4 2 2 0 2
所以可以看出它是如何工作的。然后可以将其压缩为:
SELECT
sess_id
,pol_id
,trans_dt
,version_id
,SUM(sess_edge) OVER (ORDER BY trans_dt)+1 as GROUP_ID
FROM (
SELECT
*
,IFF(LAG(r3) OVER (PARTITION BY sess_id ORDER BY trans_dt ) != r3, 1, 0) as sess_edge
FROM (
SELECT *
,ROW_NUMBER() OVER (ORDER BY trans_dt)- ROW_NUMBER() OVER (PARTITION BY sess_id ORDER BY trans_dt) as r3
FROM data
)
)
ORDER BY trans_dt;
这比 Gordon 的答案复杂得多,后者重写为与我的相同的形式:
select *
,sum(edge) over ( partition by pol_id, sess_id order by trans_dt ) as grouping
from (
select *
,lag(sess_id) over (partition by pol_id order by trans_dt) as prev_session_id
,iff(prev_session_id = sess_id, 0, 1) AS edge
from data
)
ORDER BY 2,3;
这是相当聪明的,因为每个 sess_id
对边求和
但是如果你添加额外的数据:
WITH data AS (
SELECT * FROM VALUES
(101, 1, '2021-04-17 09:30:00', 1),
(101, 1, '2021-04-17 09:35:00', 2),
(102, 1, '2021-04-17 09:37:00', 3),
(102, 1, '2021-04-17 09:38:00', 4),
(101, 1, '2021-04-17 09:39:00', 5),
(101, 1, '2021-04-17 09:40:00', 6),
(102, 1, '2021-04-17 09:41:00', 7),
(102, 1, '2021-04-17 09:42:00', 8),
(103, 1, '2021-04-17 09:43:00', 9),
(103, 1, '2021-04-17 09:44:00', 10)
v(sess_id, pol_id, trans_dt, VERSION_ID)
)
Gordon 的回答会将最后两个会话分配给第 1 组,而我的会分配给第 2 组,Lukasz 也会,这取决于您的预期结果。
此外,当 pol_id
更改时,您希望发生什么情况?您希望组成为全局计数,还是第二个 pol 再次具有值 1?
我有一个棘手的查询设计要求,我尝试了不同的 types/different 分析函数组合来从以下数据集中获得我的结果。我的另一个计划是编写存储过程,但是我想在改变方向之前联系专家组。
输入数据集:
需要组列的输出数据集:当会话 ID 中的会话 ID 发生变化时,如果我再次返回相同的会话 ID,我必须有一个不同的组。我尝试使用 LEAD/LAG 组合,但无法获得以下所需的输出,一种或另一种情况正在中断。
谢谢!
基本上,您想使用 lag()
查看会话 ID 何时更改。 然后你想要一个累计总和,但仅限于每个会话 id:
select t.*,
sum(case when prev_session_id = session_id then 0 else 1 end) over (
partition by pol_id, session_id
order by trans_dt
) as grouping
from (select t.*,
lag(session_id) over (partition by pol_id order by trans_dt) as prev_session_id
from t
) t;
这是群岛问题的一个棘手变体。更正常的情况是三对行被枚举为 1、2 和 3。为此,您只需从 sum()
中的 partition by
中删除 session_id
。
SQL 语言的表现力足以为复杂的需求找到声明式解决方案。
Snowflake 最近实施了 SQL 2016 标准条款:MATCH_RECOGNIZE,旨在以非常直接的方式解决此类情况。
Identifying Sequences of Rows That Match a Pattern
In some cases, you might need to identify sequences of table rows that match a pattern. For example, you might need to:
Determine which users followed a specific sequence of pages and actions on your website before opening a support ticket or making a purchase.
Find the stocks with prices that followed a V-shaped or W-shaped recovery over a period of time.
Look for patterns in sensor data that might indicate an upcoming system failure.
资料准备:
CREATE OR REPLACE TABLE t
AS
SELECT 101 SESS_ID, 1 POL_ID, '2021-04-17 09:30:00'::DATE AS Trans_dt, 1 AS VERSION_ID
UNION ALL SELECT 101 SESS_ID, 1 POL_ID, '2021-04-17 09:35:00'::DATE AS Trans_dt, 2
UNION ALL SELECT 102 SESS_ID, 1 POL_ID, '2021-04-17 09:37:00'::DATE AS Trans_dt, 3
UNION ALL SELECT 102 SESS_ID, 1 POL_ID, '2021-04-17 09:38:00'::DATE AS Trans_dt, 4
UNION ALL SELECT 101 SESS_ID, 1 POL_ID, '2021-04-17 09:39:00'::DATE AS Trans_dt, 5
UNION ALL SELECT 101 SESS_ID, 1 POL_ID, '2021-04-17 09:40:00'::DATE AS Trans_dt, 6;
查询:
SELECT *
FROM t
MATCH_RECOGNIZE (
PARTITION BY POL_ID
ORDER BY VERSION_ID
MEASURES MATCH_NUMBER() AS group_id
--,CLASSIFIER() as cks
ALL ROWS PER MATCH
PATTERN (a+b*)
DEFINE a as sess_id = FIRST_VALUE(sess_id)
,b AS sess_id != FIRST_VALUE(sess_id)
) mr
ORDER BY POL_ID, VERSION_ID;
输出:
SESS_ID POL_ID TRANS_DT VERSION_ID GROUP_ID
101 1 2021-04-17 1 1
101 1 2021-04-17 2 1
102 1 2021-04-17 3 1
102 1 2021-04-17 4 1
101 1 2021-04-17 5 2
101 1 2021-04-17 6 2
工作原理:
- 定义一个模式:
(a+b*)
这是 Perl 风格的正则表达式,a(一个或多个)b(零个或多个) - 定义模式组件a(sess_id与组首元素相同),b(sess_id与组首元素不同)
- 定义度量
MATCH_NUMBER()
-Returns匹配的序号 - 对每个
POL_ID
执行此操作并使用VERSION_ID
作为排序列
所以在下面,您希望 group_id
与 pol_id
的关系并不明显,所以我忽略了它。
所以使用 CTE 只是为了伪造 data
。
WITH data AS (
SELECT * FROM VALUES
(101, 1, '2021-04-17 09:30:00', 1),
(101, 1, '2021-04-17 09:35:00', 2),
(102, 1, '2021-04-17 09:37:00', 3),
(102, 1, '2021-04-17 09:38:00', 4),
(101, 1, '2021-04-17 09:39:00', 5),
(101, 1, '2021-04-17 09:40:00', 6)
v(sess_id, pol_id, trans_dt, version_id)
)
然后我想写这些操作:
SELECT *
,ROW_NUMBER() OVER (ORDER BY trans_dt) AS r1
,ROW_NUMBER() OVER (PARTITION BY sess_id ORDER BY trans_dt) AS r2
,r1- r2 as r3
,LAG(r3) OVER (PARTITION BY sess_id ORDER BY trans_dt ) as lag_r3
,IFF(lag_r3 != r3, 1, 0) as sess_edge
,SUM(sess_edge) OVER (ORDER BY trans_dt)+1 as GROUP_ID
FROM data
所以 r1
和 r2
发现 sess_id
相对于 trans_dt
存在差距,那么您需要 [=24= 的那些变化] 和 lag_r3
相对于 trans_dt
,这些是你想要计算的边缘,因此 SUM
,它是从零开始的,所以 +1
得到你想要的价值。
现在以上在 Snowflake 中无效,因此需要分层才能工作:
SELECT
*
,SUM(sess_edge) OVER (ORDER BY trans_dt)+1 as GROUP_ID
FROM (
SELECT
*
,LAG(r3) OVER (PARTITION BY sess_id ORDER BY trans_dt ) as lag_r3
,IFF(lag_r3 != r3, 1, 0) as sess_edge
FROM (
SELECT *
,ROW_NUMBER() OVER (ORDER BY trans_dt) AS r1
,ROW_NUMBER() OVER (PARTITION BY sess_id ORDER BY trans_dt) AS r2
,r1- r2 as r3
FROM data
)
)
ORDER BY trans_dt;
给出:
SESS_ID POL_ID TRANS_DT VERSION_ID R1 R2 R3 LAG_R3 SESS_EDGE GROUP_ID
101 1 2021-04-17 09:30:00 1 1 1 0 null 0 1
101 1 2021-04-17 09:35:00 2 2 2 0 0 0 1
102 1 2021-04-17 09:37:00 3 3 1 2 null 0 1
102 1 2021-04-17 09:38:00 4 4 2 2 2 0 1
101 1 2021-04-17 09:39:00 5 5 3 2 0 1 2
101 1 2021-04-17 09:40:00 6 6 4 2 2 0 2
所以可以看出它是如何工作的。然后可以将其压缩为:
SELECT
sess_id
,pol_id
,trans_dt
,version_id
,SUM(sess_edge) OVER (ORDER BY trans_dt)+1 as GROUP_ID
FROM (
SELECT
*
,IFF(LAG(r3) OVER (PARTITION BY sess_id ORDER BY trans_dt ) != r3, 1, 0) as sess_edge
FROM (
SELECT *
,ROW_NUMBER() OVER (ORDER BY trans_dt)- ROW_NUMBER() OVER (PARTITION BY sess_id ORDER BY trans_dt) as r3
FROM data
)
)
ORDER BY trans_dt;
这比 Gordon 的答案复杂得多,后者重写为与我的相同的形式:
select *
,sum(edge) over ( partition by pol_id, sess_id order by trans_dt ) as grouping
from (
select *
,lag(sess_id) over (partition by pol_id order by trans_dt) as prev_session_id
,iff(prev_session_id = sess_id, 0, 1) AS edge
from data
)
ORDER BY 2,3;
这是相当聪明的,因为每个 sess_id
但是如果你添加额外的数据:
WITH data AS (
SELECT * FROM VALUES
(101, 1, '2021-04-17 09:30:00', 1),
(101, 1, '2021-04-17 09:35:00', 2),
(102, 1, '2021-04-17 09:37:00', 3),
(102, 1, '2021-04-17 09:38:00', 4),
(101, 1, '2021-04-17 09:39:00', 5),
(101, 1, '2021-04-17 09:40:00', 6),
(102, 1, '2021-04-17 09:41:00', 7),
(102, 1, '2021-04-17 09:42:00', 8),
(103, 1, '2021-04-17 09:43:00', 9),
(103, 1, '2021-04-17 09:44:00', 10)
v(sess_id, pol_id, trans_dt, VERSION_ID)
)
Gordon 的回答会将最后两个会话分配给第 1 组,而我的会分配给第 2 组,Lukasz 也会,这取决于您的预期结果。
此外,当 pol_id
更改时,您希望发生什么情况?您希望组成为全局计数,还是第二个 pol 再次具有值 1?