差距和孤岛问题 - 查询不适用于所有时期
Gap and Island problem - query not working for all periods
我必须创建一个查询来查找日期之间的间隔和孤岛。这似乎是一个标准的差距和孤岛问题。为了显示我的问题,我将使用数据样本。查询在 Snowflake 中执行。
CREATE TABLE TEST (StartDate date, EndDate date);
INSERT INTO TEST
SELECT '8/20/2017', '8/21/2017' UNION ALL
SELECT '8/22/2017', '9/22/2017' UNION ALL
SELECT '8/23/2017', '9/23/2017' UNION ALL
SELECT '8/24/2017', '8/26/2017' UNION ALL
SELECT '8/28/2017', '9/19/2017' UNION ALL
SELECT '9/23/2017', '9/27/2017' UNION ALL
SELECT '9/25/2017', '10/10/2017' UNION ALL
SELECT '10/17/2017','10/18/2017' UNION ALL
SELECT '10/25/2017','11/3/2017' UNION ALL
SELECT '11/3/2017', '11/15/2017';
此代码为我提供了 table 的示例。
然后我就有了寻找空隙和孤岛的代码:
SELECT
MIN(StartDate) AS IslandStartDate,
MAX(EndDate) AS IslandEndDate
FROM
(
SELECT
*,
CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd,
SUM(CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY Groups.RN) AS IslandId
FROM
(
SELECT
ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
StartDate,
EndDate,
LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
FROM
TEST
) Groups
) Islands
GROUP BY
IslandId
ORDER BY
IslandStartDate
结果是:
如您所见,问题发生在 8/28/2017 - 9/19/2017 期间。
这个时期应该不是一个单独的岛,因为它应该被包括在时期:8/23/2017 - 9/23/2017.
你知道我如何修改我的查询以获得正确的结果吗(所以 6 我应该有 5 个岛屿,因为 8/28/2017 - 9/19/2017 不应该是岛屿)。这只是数据示例,所以我正在寻找通用的解决方案,但到目前为止我还没有找到正确的方法。
您可以删除原始集中的重叠记录:
SELECT MinStart as StartDate, MaxEnd as EndDate
FROM Test data
CROSS APPLY (SELECT MIN(StartDate) MinStart, MAX(EndDate) MaxEnd FROM TEST lkp WHERE lkp.StartDate < data.EndDate AND lkp.EndDate > data.StartDate) bounds
GROUP BY MinStart, MaxEnd
StartDate
EndDate
2017-08-20
2017-08-21
2017-08-22
2017-09-23
2017-08-23
2017-10-10
2017-10-17
2017-10-18
2017-10-25
2017-11-03
2017-11-03
2017-11-15
In this current result set, no additional duplications have occurred, but in a larger recordset there would be more potential for a much larger range of contiguous records. Meaning you may need to recursively execute this lookup.
把它们放在一起你得到:
SELECT
MIN(StartDate) AS IslandStartDate,
MAX(EndDate) AS IslandEndDate
FROM
(
SELECT
*,
CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd,
SUM(CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY Groups.RN) AS IslandId
FROM
(
SELECT
ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
StartDate,
EndDate,
LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
FROM
(
SELECT MinStart as StartDate, MaxEnd as EndDate
FROM Test data
CROSS APPLY (SELECT MIN(StartDate) MinStart, MAX(EndDate) MaxEnd FROM TEST lkp WHERE lkp.StartDate < data.EndDate AND lkp.EndDate > data.StartDate) bounds
GROUP BY MinStart, MaxEnd
) Normalized
) Groups
) Islands
GROUP BY
IslandId
ORDER BY
IslandStartDate
这会导致 4 个岛屿,而不是您最初预期的 5 个,因为您的第 2 和第 3 输入行 以及第 6 和第 7 行,他们创建了一个跨越 8/22 - 10/10 的岛屿!
SELECT '8/22/2017', '9/22/2017' UNION ALL
SELECT '8/23/2017', '9/23/2017' UNION ALL
...
SELECT '9/23/2017', '9/27/2017' UNION ALL
SELECT '9/25/2017', '10/10/2017' UNION ALL
IslandStartDate
IslandEndDate
2017-08-20
2017-08-21
2017-08-22
2017-10-10
2017-10-17
2017-10-18
2017-10-25
2017-11-15
您可以这样表达间隙和孤岛逻辑:
select min(startdate), max(enddate)
from (select t.*,
sum(case when prev_enddate >= startdate then 0 else 1 end) over (order by startdate) as grp
from (select t.*,
max(enddate) over (order by startdate rows between unbounded preceding and 1 preceding) as prev_enddate
from test t
) t
) t
group by grp
order by min(startdate);
Here 是一个 db<>fiddle.
我们的想法是寻找所有“较早”行的最大结束日期。该值用于检查是否存在重叠。
所以,最里面的子查询计算的是前一个结束日期。中间子查询对组的开头进行累加和以分配组标识符。
外部查询只是按组标识符聚合。
我必须创建一个查询来查找日期之间的间隔和孤岛。这似乎是一个标准的差距和孤岛问题。为了显示我的问题,我将使用数据样本。查询在 Snowflake 中执行。
CREATE TABLE TEST (StartDate date, EndDate date);
INSERT INTO TEST
SELECT '8/20/2017', '8/21/2017' UNION ALL
SELECT '8/22/2017', '9/22/2017' UNION ALL
SELECT '8/23/2017', '9/23/2017' UNION ALL
SELECT '8/24/2017', '8/26/2017' UNION ALL
SELECT '8/28/2017', '9/19/2017' UNION ALL
SELECT '9/23/2017', '9/27/2017' UNION ALL
SELECT '9/25/2017', '10/10/2017' UNION ALL
SELECT '10/17/2017','10/18/2017' UNION ALL
SELECT '10/25/2017','11/3/2017' UNION ALL
SELECT '11/3/2017', '11/15/2017';
此代码为我提供了 table 的示例。
然后我就有了寻找空隙和孤岛的代码:
SELECT
MIN(StartDate) AS IslandStartDate,
MAX(EndDate) AS IslandEndDate
FROM
(
SELECT
*,
CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd,
SUM(CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY Groups.RN) AS IslandId
FROM
(
SELECT
ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
StartDate,
EndDate,
LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
FROM
TEST
) Groups
) Islands
GROUP BY
IslandId
ORDER BY
IslandStartDate
结果是:
如您所见,问题发生在 8/28/2017 - 9/19/2017 期间。 这个时期应该不是一个单独的岛,因为它应该被包括在时期:8/23/2017 - 9/23/2017.
你知道我如何修改我的查询以获得正确的结果吗(所以 6 我应该有 5 个岛屿,因为 8/28/2017 - 9/19/2017 不应该是岛屿)。这只是数据示例,所以我正在寻找通用的解决方案,但到目前为止我还没有找到正确的方法。
您可以删除原始集中的重叠记录:
SELECT MinStart as StartDate, MaxEnd as EndDate
FROM Test data
CROSS APPLY (SELECT MIN(StartDate) MinStart, MAX(EndDate) MaxEnd FROM TEST lkp WHERE lkp.StartDate < data.EndDate AND lkp.EndDate > data.StartDate) bounds
GROUP BY MinStart, MaxEnd
StartDate | EndDate |
---|---|
2017-08-20 | 2017-08-21 |
2017-08-22 | 2017-09-23 |
2017-08-23 | 2017-10-10 |
2017-10-17 | 2017-10-18 |
2017-10-25 | 2017-11-03 |
2017-11-03 | 2017-11-15 |
In this current result set, no additional duplications have occurred, but in a larger recordset there would be more potential for a much larger range of contiguous records. Meaning you may need to recursively execute this lookup.
把它们放在一起你得到:
SELECT
MIN(StartDate) AS IslandStartDate,
MAX(EndDate) AS IslandEndDate
FROM
(
SELECT
*,
CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd,
SUM(CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY Groups.RN) AS IslandId
FROM
(
SELECT
ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
StartDate,
EndDate,
LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
FROM
(
SELECT MinStart as StartDate, MaxEnd as EndDate
FROM Test data
CROSS APPLY (SELECT MIN(StartDate) MinStart, MAX(EndDate) MaxEnd FROM TEST lkp WHERE lkp.StartDate < data.EndDate AND lkp.EndDate > data.StartDate) bounds
GROUP BY MinStart, MaxEnd
) Normalized
) Groups
) Islands
GROUP BY
IslandId
ORDER BY
IslandStartDate
这会导致 4 个岛屿,而不是您最初预期的 5 个,因为您的第 2 和第 3 输入行 以及第 6 和第 7 行,他们创建了一个跨越 8/22 - 10/10 的岛屿!
SELECT '8/22/2017', '9/22/2017' UNION ALL
SELECT '8/23/2017', '9/23/2017' UNION ALL
...
SELECT '9/23/2017', '9/27/2017' UNION ALL
SELECT '9/25/2017', '10/10/2017' UNION ALL
IslandStartDate | IslandEndDate |
---|---|
2017-08-20 | 2017-08-21 |
2017-08-22 | 2017-10-10 |
2017-10-17 | 2017-10-18 |
2017-10-25 | 2017-11-15 |
您可以这样表达间隙和孤岛逻辑:
select min(startdate), max(enddate)
from (select t.*,
sum(case when prev_enddate >= startdate then 0 else 1 end) over (order by startdate) as grp
from (select t.*,
max(enddate) over (order by startdate rows between unbounded preceding and 1 preceding) as prev_enddate
from test t
) t
) t
group by grp
order by min(startdate);
Here 是一个 db<>fiddle.
我们的想法是寻找所有“较早”行的最大结束日期。该值用于检查是否存在重叠。
所以,最里面的子查询计算的是前一个结束日期。中间子查询对组的开头进行累加和以分配组标识符。
外部查询只是按组标识符聚合。