window 函数中的条件聚合:间隙和孤岛问题
Conditional aggregation in a window function: gaps and island problem
我正在尝试创建事件时间表,其中一些是重复的:
源数据
user_id
question
opinion
last_modified
21175381
13
1
2019-03-11
21175381
13
1
2019-03-12
21175381
13
0
2019-03-13
21175381
13
1
2019-03-14
21175381
13
0
2019-03-16
21175381
13
0
2019-03-17
21175381
13
0
2019-03-18
想象一下在随机的时间段内被问到相同的问题 - 例如“你同意还是不同意某项陈述?” 我的工作目标是描述支持期 - 回复保持 1 或 0 的时间:
目标table
user_id
question
opinion
from
until
21175381
13
1
2019-03-11
2019-03-13
21175381
13
0
2019-03-13
2019-03-14
21175381
13
1
2019-03-14
2019-03-16
21175381
13
0
2019-03-16
NULL
最后一个“直到”条目应该是 NULL,因为这是最后一次测量,我们预计此后不会发生变化。
我尝试了一个带有行偏移量的基本 window 函数,但是它没有考虑重复项,只是创建了多个 from/until 日期不正确的条目:
连续重复示例 - 不良行为
user_id
question
opinion
from
until
21175381
13
1
2019-03-11
2019-03-12
21175381
13
1
2019-03-12
2019-03-13
我在 ADF 中构建它,因此欢迎任何关于如何通过纯 SQL 或 ADF 转换让它工作的想法!非常感谢您付出的时间、精力和知识!
使用LAG()
和SUM()
window函数创建opinion
的值没有改变的分组,并按user_id分组,问题到获取 CTE 中每个组的最短日期。
然后进行 CTE 的自连接:
WITH cte AS (
SELECT user_id, question, grp, opinion, MIN(last_modified) [from]
FROM (
SELECT *, SUM(CASE WHEN opinion <> prev_opinion THEN 1 ELSE 0 END) OVER (PARTITION BY user_id, question ORDER BY last_modified) grp
FROM (
SELECT *, LAG(opinion, 1, ~opinion) OVER (PARTITION BY user_id, question ORDER BY last_modified) prev_opinion
FROM tablename
) t
) t
GROUP BY user_id, question, opinion, grp
)
SELECT c1.user_id, c1.question, c1.opinion, c1.[from], c2.[from] until
FROM cte c1 LEFT JOIN cte c2
ON c2.user_id = c1.user_id AND c2.question = c1.question AND c2.grp = c1.grp + 1
参见demo。
在 @forpas's answer 的基础上,您可以使用 LEAD
.
删除自连接
SELECT
user_id,
question,
grp,
opinion,
MIN(last_modified) [from],
LEAD(MIN(last_modified)) OVER
(PARTITION BY user_id, question ORDER BY grp) [until]
FROM (
SELECT *, SUM(CASE WHEN opinion <> prev_opinion THEN 1 ELSE 0 END) OVER (PARTITION BY user_id, question ORDER BY last_modified) grp
FROM (
SELECT *,
LAG(opinion, 1, -1) OVER (PARTITION BY user_id, question ORDER BY last_modified) prev_opinion
FROM tablename
) t
) t
GROUP BY user_id, question, opinion, grp;
我认为解决此问题的最简单方法是使用 lag()
在发生变化时保留第一条记录。不需要聚合或 join
。
然后lead()
得到下一个值:
select user_id, question, opinion,
last_modified as from,
lead(last_modified) over (partition by user_id, question order by last_modified) as until
from (select t.*,
lag(opinion) over (partition by user_id, question order by last_modified) as prev_opinion
from t
) t
where prev_opinion <> opinion or prev_opinion is null;
我认为这是获得所需结果的最简单方法。它也应该有最好的性能。
Here 是一个 db<>fiddle.
我正在尝试创建事件时间表,其中一些是重复的:
源数据
user_id | question | opinion | last_modified |
---|---|---|---|
21175381 | 13 | 1 | 2019-03-11 |
21175381 | 13 | 1 | 2019-03-12 |
21175381 | 13 | 0 | 2019-03-13 |
21175381 | 13 | 1 | 2019-03-14 |
21175381 | 13 | 0 | 2019-03-16 |
21175381 | 13 | 0 | 2019-03-17 |
21175381 | 13 | 0 | 2019-03-18 |
想象一下在随机的时间段内被问到相同的问题 - 例如“你同意还是不同意某项陈述?” 我的工作目标是描述支持期 - 回复保持 1 或 0 的时间:
目标table
user_id | question | opinion | from | until |
---|---|---|---|---|
21175381 | 13 | 1 | 2019-03-11 | 2019-03-13 |
21175381 | 13 | 0 | 2019-03-13 | 2019-03-14 |
21175381 | 13 | 1 | 2019-03-14 | 2019-03-16 |
21175381 | 13 | 0 | 2019-03-16 | NULL |
最后一个“直到”条目应该是 NULL,因为这是最后一次测量,我们预计此后不会发生变化。
我尝试了一个带有行偏移量的基本 window 函数,但是它没有考虑重复项,只是创建了多个 from/until 日期不正确的条目:
连续重复示例 - 不良行为
user_id | question | opinion | from | until |
---|---|---|---|---|
21175381 | 13 | 1 | 2019-03-11 | 2019-03-12 |
21175381 | 13 | 1 | 2019-03-12 | 2019-03-13 |
我在 ADF 中构建它,因此欢迎任何关于如何通过纯 SQL 或 ADF 转换让它工作的想法!非常感谢您付出的时间、精力和知识!
使用LAG()
和SUM()
window函数创建opinion
的值没有改变的分组,并按user_id分组,问题到获取 CTE 中每个组的最短日期。
然后进行 CTE 的自连接:
WITH cte AS (
SELECT user_id, question, grp, opinion, MIN(last_modified) [from]
FROM (
SELECT *, SUM(CASE WHEN opinion <> prev_opinion THEN 1 ELSE 0 END) OVER (PARTITION BY user_id, question ORDER BY last_modified) grp
FROM (
SELECT *, LAG(opinion, 1, ~opinion) OVER (PARTITION BY user_id, question ORDER BY last_modified) prev_opinion
FROM tablename
) t
) t
GROUP BY user_id, question, opinion, grp
)
SELECT c1.user_id, c1.question, c1.opinion, c1.[from], c2.[from] until
FROM cte c1 LEFT JOIN cte c2
ON c2.user_id = c1.user_id AND c2.question = c1.question AND c2.grp = c1.grp + 1
参见demo。
在 @forpas's answer 的基础上,您可以使用 LEAD
.
SELECT
user_id,
question,
grp,
opinion,
MIN(last_modified) [from],
LEAD(MIN(last_modified)) OVER
(PARTITION BY user_id, question ORDER BY grp) [until]
FROM (
SELECT *, SUM(CASE WHEN opinion <> prev_opinion THEN 1 ELSE 0 END) OVER (PARTITION BY user_id, question ORDER BY last_modified) grp
FROM (
SELECT *,
LAG(opinion, 1, -1) OVER (PARTITION BY user_id, question ORDER BY last_modified) prev_opinion
FROM tablename
) t
) t
GROUP BY user_id, question, opinion, grp;
我认为解决此问题的最简单方法是使用 lag()
在发生变化时保留第一条记录。不需要聚合或 join
。
然后lead()
得到下一个值:
select user_id, question, opinion,
last_modified as from,
lead(last_modified) over (partition by user_id, question order by last_modified) as until
from (select t.*,
lag(opinion) over (partition by user_id, question order by last_modified) as prev_opinion
from t
) t
where prev_opinion <> opinion or prev_opinion is null;
我认为这是获得所需结果的最简单方法。它也应该有最好的性能。
Here 是一个 db<>fiddle.