差距和孤岛问题 - 查询不适用于所有时期

Gap and Island problem - query not working for all periods

我必须创建一个查询来查找日期之间的间隔和孤岛。这似乎是一个标准的差距和孤岛问题。为了显示我的问题,我将使用数据样本。查询在 Snowflake 中执行。

CREATE TABLE TEST (StartDate date, EndDate date);
INSERT INTO TEST
SELECT '8/20/2017', '8/21/2017'  UNION ALL
SELECT '8/22/2017', '9/22/2017'  UNION ALL
SELECT '8/23/2017', '9/23/2017'  UNION ALL 
SELECT '8/24/2017', '8/26/2017'  UNION ALL 
SELECT '8/28/2017', '9/19/2017'  UNION ALL 
SELECT '9/23/2017', '9/27/2017'  UNION ALL 
SELECT '9/25/2017', '10/10/2017' UNION ALL
SELECT '10/17/2017','10/18/2017' UNION ALL 
SELECT '10/25/2017','11/3/2017'  UNION ALL 
SELECT '11/3/2017', '11/15/2017';

此代码为我提供了 table 的示例。

然后我就有了寻找空隙和孤岛的代码:

SELECT
    MIN(StartDate) AS IslandStartDate,
    MAX(EndDate) AS IslandEndDate
FROM
    (
    SELECT
        *,
        CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd,
        SUM(CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY Groups.RN) AS IslandId
    FROM
    (
        SELECT
            ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
            StartDate,
            EndDate,
            LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
        FROM
            TEST
    ) Groups
) Islands
GROUP BY
    IslandId
ORDER BY 
    IslandStartDate

结果是:

如您所见,问题发生在 8/28/2017 - 9/19/2017 期间。 这个时期应该不是一个单独的岛,因为它应该被包括在时期:8/23/2017 - 9/23/2017.

你知道我如何修改我的查询以获得正确的结果吗(所以 6 我应该有 5 个岛屿,因为 8/28/2017 - 9/19/2017 不应该是岛屿)。这只是数据示例,所以我正在寻找通用的解决方案,但到目前为止我还没有找到正确的方法。

您可以删除原始集中的重叠记录:

SELECT MinStart as StartDate, MaxEnd as EndDate
FROM Test data
CROSS APPLY (SELECT MIN(StartDate) MinStart, MAX(EndDate) MaxEnd FROM TEST lkp WHERE lkp.StartDate < data.EndDate AND lkp.EndDate > data.StartDate) bounds
GROUP BY MinStart, MaxEnd
StartDate EndDate
2017-08-20 2017-08-21
2017-08-22 2017-09-23
2017-08-23 2017-10-10
2017-10-17 2017-10-18
2017-10-25 2017-11-03
2017-11-03 2017-11-15

In this current result set, no additional duplications have occurred, but in a larger recordset there would be more potential for a much larger range of contiguous records. Meaning you may need to recursively execute this lookup.

把它们放在一起你得到:

SELECT
    MIN(StartDate) AS IslandStartDate,
    MAX(EndDate) AS IslandEndDate
FROM
    (
    SELECT
        *,
        CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd,
        SUM(CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY Groups.RN) AS IslandId
    FROM
    (
        SELECT
            ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
            StartDate,
            EndDate,
            LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
        FROM
        (
            SELECT MinStart as StartDate, MaxEnd as EndDate
            FROM Test data
            CROSS APPLY (SELECT MIN(StartDate) MinStart, MAX(EndDate) MaxEnd FROM TEST lkp WHERE lkp.StartDate < data.EndDate AND lkp.EndDate > data.StartDate) bounds
            GROUP BY MinStart, MaxEnd       
        ) Normalized
    ) Groups
) Islands
GROUP BY
    IslandId
ORDER BY 
    IslandStartDate

这会导致 4 个岛屿,而不是您最初预期的 5 个,因为您的第 2 和第 3 输入行 以及第 6 和第 7 行,他们创建了一个跨越 8/22 - 10/10 的岛屿!

SELECT '8/22/2017', '9/22/2017' UNION ALL
SELECT '8/23/2017', '9/23/2017' UNION ALL 
...
SELECT '9/23/2017', '9/27/2017'  UNION ALL 
SELECT '9/25/2017', '10/10/2017' UNION ALL
IslandStartDate IslandEndDate
2017-08-20 2017-08-21
2017-08-22 2017-10-10
2017-10-17 2017-10-18
2017-10-25 2017-11-15

您可以这样表达间隙和孤岛逻辑:

select min(startdate), max(enddate)
from (select t.*,
             sum(case when prev_enddate >= startdate then 0 else 1 end) over (order by startdate) as grp
      from (select t.*,
                   max(enddate) over (order by startdate rows between unbounded preceding and 1 preceding) as prev_enddate
            from test t
           ) t
     ) t
group by grp
order by min(startdate);

Here 是一个 db<>fiddle.

我们的想法是寻找所有“较早”行的最大结束日期。该值用于检查是否存在重叠。

所以,最里面的子查询计算的是前一个结束日期。中间子查询对组的开头进行累加和以分配组标识符。

外部查询只是按组标识符聚合。