如何从开始日期和结束日期识别和聚合序列
How to identify and aggregate sequence from start and end dates
我正在尝试根据 person
确定日期中的连续序列,以及该序列的总和 amount
。我的 records
table 看起来像这样:
person start_date end_date amount
1 2015-09-10 2015-09-11 500
1 2015-09-11 2015-09-12 100
1 2015-09-13 2015-09-14 200
1 2015-10-05 2015-10-07 2000
2 2015-10-05 2015-10-05 300
2 2015-10-06 2015-10-06 1000
3 2015-04-23 2015-04-23 900
结果查询应该是这样的:
person sequence_start_date sequence_end_date amount
1 2015-09-10 2015-09-14 800
1 2015-10-05 2015-10-07 2000
2 2015-10-05 2015-10-06 1400
3 2015-04-23 2015-04-23 900
下面,我可以使用 LAG 和 LEAD 来识别序列 start_date
和 end_date
,但我没有办法聚合 amount
。我假设答案将涉及某种按顺序分区的 ROW_NUMBER()
window 函数,我只是不知道如何使函数可以识别序列。
SELECT
person
,COALESCE(sequence_start_date, LAG(sequence_start_date, 1) OVER (ORDER BY person, start_date)) AS "sequence_start_date"
,COALESCE(sequence_end_date, LEAD(sequence_end_date, 1) OVER (ORDER BY person, start_date)) AS "sequence_end_date"
FROM
(
SELECT
person
,start_date
,end_date
,CASE WHEN LAG(end_date, 1) OVER (PARTITION BY person ORDER BY start_date) + interval '1 day' = start_date
THEN NULL
ELSE start_date
END AS "sequence_start_date"
,CASE WHEN LEAD(start_date, 1) OVER (PARTITION BY person ORDER BY start_date) - interval '1 day' = end_date
THEN NULL
ELSE end_date
END AS "sequence_end_date"
,amount
FROM records
) sq
为什么不呢:
select a1.person, a1.sequence_start_date, a1.sequence_end_date,
sum(rx.amount)
as amount
from (EXISTING_QUERY) a1
left join records rx
on rx.person = a1.person
and rx.start_date >= a1.start_date
and rx.end_date <= a1.end_date
group by a1.person, a1.sequence_start_date, a1.sequence_end_date
即使您更新的(子)查询仍然不适合您提供的数据,这与序列中第二行和后续行的开始日期是否应等于其前一行的开始日期不一致' 结束日期或一天后。如果需要,可以很容易地更新查询以适应两者。
在任何情况下,您都不能将 COALESCE 用作 window 函数。通过提供 OVER
子句,聚合函数可以用作 window 函数,但不能用作普通函数。尽管如此,还是有一些方法可以将 window 函数应用于此任务。这是一种识别数据中序列的方法(如图所示):
SELECT
person
,MAX(sequence_start_date)
OVER (
PARTITION BY person
ORDER BY start_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "sequence_start_date"
,MIN(sequence_end_date)
OVER (
PARTITION BY person
ORDER BY start_date
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
AS "sequence_end_date"
,amount
FROM
(
SELECT
person
,start_date
,end_date
,CASE WHEN LAG(end_date, 1) OVER (PARTITION BY person ORDER BY start_date) + interval '1 day' >= start_date
THEN date '0001-01-01'
ELSE start_date
END AS "sequence_start_date"
,CASE WHEN LEAD(start_date, 1) OVER (PARTITION BY person ORDER BY start_date) - interval '1 day' <= end_date
THEN NULL
ELSE end_date
END AS "sequence_end_date"
,amount
FROM records
order by person, start_date
) sq_part
ORDER BY person, sequence_start_date
它依赖于 MAX()
和 MIN()
而不是 COALESCE()
,并且它应用 window 框架来为每个分区中的每一个获取适当的范围。结果:
person sequence_start_date sequence_end_date amount
1 September, 10 2015 00:00:00 September, 12 2015 00:00:00 500
1 September, 10 2015 00:00:00 September, 12 2015 00:00:00 100
1 October, 05 2015 00:00:00 October, 07 2015 00:00:00 2000
2 October, 05 2015 00:00:00 October, 06 2015 00:00:00 300
2 October, 05 2015 00:00:00 October, 06 2015 00:00:00 1000
3 April, 23 2015 00:00:00 April, 23 2015 00:00:00 900
请注意,这不需要结束日期与后续开始日期完全匹配;邻接 或重叠 的每个人的所有行都将分配给相同的序列。但是,如果 (person
, start_date
) 不能被认为是唯一的,那么您可能还需要按结束日期对分区进行排序。
现在您有了一种识别序列的方法:它们由三元组 person, sequence_start_date, sequence_end_date
表征。 (或者实际上,为了识别目的,您只需要这些日期中的人和 一个,但请继续阅读。)您可以将上述查询包装为外部聚合查询的内联视图以生成您想要的结果:
SELECT
person,
sequence_start_date,
sequence_end_date,
SUM(amount) AS "amount"
FROM ( <above query> ) sq
GROUP BY person, sequence_start_date, sequence_end_date
当然,如果您要 select 它们,则需要将这两个日期作为分组列。
我正在尝试根据 person
确定日期中的连续序列,以及该序列的总和 amount
。我的 records
table 看起来像这样:
person start_date end_date amount
1 2015-09-10 2015-09-11 500
1 2015-09-11 2015-09-12 100
1 2015-09-13 2015-09-14 200
1 2015-10-05 2015-10-07 2000
2 2015-10-05 2015-10-05 300
2 2015-10-06 2015-10-06 1000
3 2015-04-23 2015-04-23 900
结果查询应该是这样的:
person sequence_start_date sequence_end_date amount
1 2015-09-10 2015-09-14 800
1 2015-10-05 2015-10-07 2000
2 2015-10-05 2015-10-06 1400
3 2015-04-23 2015-04-23 900
下面,我可以使用 LAG 和 LEAD 来识别序列 start_date
和 end_date
,但我没有办法聚合 amount
。我假设答案将涉及某种按顺序分区的 ROW_NUMBER()
window 函数,我只是不知道如何使函数可以识别序列。
SELECT
person
,COALESCE(sequence_start_date, LAG(sequence_start_date, 1) OVER (ORDER BY person, start_date)) AS "sequence_start_date"
,COALESCE(sequence_end_date, LEAD(sequence_end_date, 1) OVER (ORDER BY person, start_date)) AS "sequence_end_date"
FROM
(
SELECT
person
,start_date
,end_date
,CASE WHEN LAG(end_date, 1) OVER (PARTITION BY person ORDER BY start_date) + interval '1 day' = start_date
THEN NULL
ELSE start_date
END AS "sequence_start_date"
,CASE WHEN LEAD(start_date, 1) OVER (PARTITION BY person ORDER BY start_date) - interval '1 day' = end_date
THEN NULL
ELSE end_date
END AS "sequence_end_date"
,amount
FROM records
) sq
为什么不呢:
select a1.person, a1.sequence_start_date, a1.sequence_end_date,
sum(rx.amount)
as amount
from (EXISTING_QUERY) a1
left join records rx
on rx.person = a1.person
and rx.start_date >= a1.start_date
and rx.end_date <= a1.end_date
group by a1.person, a1.sequence_start_date, a1.sequence_end_date
即使您更新的(子)查询仍然不适合您提供的数据,这与序列中第二行和后续行的开始日期是否应等于其前一行的开始日期不一致' 结束日期或一天后。如果需要,可以很容易地更新查询以适应两者。
在任何情况下,您都不能将 COALESCE 用作 window 函数。通过提供 OVER
子句,聚合函数可以用作 window 函数,但不能用作普通函数。尽管如此,还是有一些方法可以将 window 函数应用于此任务。这是一种识别数据中序列的方法(如图所示):
SELECT
person
,MAX(sequence_start_date)
OVER (
PARTITION BY person
ORDER BY start_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS "sequence_start_date"
,MIN(sequence_end_date)
OVER (
PARTITION BY person
ORDER BY start_date
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
AS "sequence_end_date"
,amount
FROM
(
SELECT
person
,start_date
,end_date
,CASE WHEN LAG(end_date, 1) OVER (PARTITION BY person ORDER BY start_date) + interval '1 day' >= start_date
THEN date '0001-01-01'
ELSE start_date
END AS "sequence_start_date"
,CASE WHEN LEAD(start_date, 1) OVER (PARTITION BY person ORDER BY start_date) - interval '1 day' <= end_date
THEN NULL
ELSE end_date
END AS "sequence_end_date"
,amount
FROM records
order by person, start_date
) sq_part
ORDER BY person, sequence_start_date
它依赖于 MAX()
和 MIN()
而不是 COALESCE()
,并且它应用 window 框架来为每个分区中的每一个获取适当的范围。结果:
person sequence_start_date sequence_end_date amount
1 September, 10 2015 00:00:00 September, 12 2015 00:00:00 500
1 September, 10 2015 00:00:00 September, 12 2015 00:00:00 100
1 October, 05 2015 00:00:00 October, 07 2015 00:00:00 2000
2 October, 05 2015 00:00:00 October, 06 2015 00:00:00 300
2 October, 05 2015 00:00:00 October, 06 2015 00:00:00 1000
3 April, 23 2015 00:00:00 April, 23 2015 00:00:00 900
请注意,这不需要结束日期与后续开始日期完全匹配;邻接 或重叠 的每个人的所有行都将分配给相同的序列。但是,如果 (person
, start_date
) 不能被认为是唯一的,那么您可能还需要按结束日期对分区进行排序。
现在您有了一种识别序列的方法:它们由三元组 person, sequence_start_date, sequence_end_date
表征。 (或者实际上,为了识别目的,您只需要这些日期中的人和 一个,但请继续阅读。)您可以将上述查询包装为外部聚合查询的内联视图以生成您想要的结果:
SELECT
person,
sequence_start_date,
sequence_end_date,
SUM(amount) AS "amount"
FROM ( <above query> ) sq
GROUP BY person, sequence_start_date, sequence_end_date
当然,如果您要 select 它们,则需要将这两个日期作为分组列。