填补 SQL 服务器日期范围内的空白

Fill gaps in SQL Server dates ranges

在 SQL Server 2014 中,我有一个 Periods table,如下所示:

| PeriodId | PeriodStart | PeriodEnd  |
---------------------------------------
| 202005   | 2020-05-01  | 2020-05-31 |
| 202006   | 2020-06-01  | 2020-06-30 |

经期并不总是从每月的第一天到最后一天。

然后我有一个 Activities table,它有一些用户编程的活动:

| ActivityId | UserId | ActivityStart | ActivityEnd |
-----------------------------------------------------
| 1          | A      | 2020-05-20    | 2020-06-05  |
| 2          | A      | 2020-06-15    | 2020-06-18  |
| 3          | B      | 2020-06-10    | 2020-06-25  |

用户的活动之间可以有间隔,但同一用户永远不会有重叠的活动。

现在我需要一个查询,将活动日期范围限制为期间的开始和结束,并填补空白以完成该期间。我将始终按 PeriodId 进行过滤,因此我将只放置 PeriodId = 202006:

的示例结果
| PeriodId | UserId | ActivityId | NewActivityStart | NewActivityEnd |
----------------------------------------------------------------------
| 202006   | A      | 1          | 2020-06-01       | 2020-06-05     |  --Part of ActivityId 1
| 202006   | A      | NULL       | 2020-06-06       | 2020-06-14     |  --Fill between activities 1 and 2
| 202006   | A      | 2          | 2020-06-15       | 2020-06-18     |
| 202006   | A      | NULL       | 2020-06-19       | 2020-06-30     |  --Fill until end of period
| 202006   | B      | NULL       | 2020-06-01       | 2020-06-09     |  --Fill from start of period
| 202006   | B      | 3          | 2020-06-10       | 2020-06-25     |
| 202006   | B      | NULL       | 2020-06-26       | 2020-06-30     |  --Fill until end of period

我已经能够通过以下查询包含该期间内的 activity 个日期:

SELECT p.PeriodId, a.UserId, a.ActivityId
       IIF(p.PeriodStart > a.ActivityStart, p.PeriodStart, a.ActivityStart) AS NewActivityStart,
       IIF(p.PeriodEnd < a.ActivityEnd, p.PeriodEnd, a.ActivityEnd) AS NewActivityEnd
FROM Periods p
JOIN Activities a ON a.ActivityStart <= p.PeriodEnd AND a.ActivityEnd >= p.PeriodStart

但我无法填补范围内的空白。我尝试过使用相关日期 table and/or 和 Window 函数,例如 LAG/LEAD.

我觉得 Window 函数可能是解决方案,我尝试遵循 examples 关于 gaps/islands,但我一直无法很好地理解它们足以让它发挥作用。

有没有办法完成查询以填补缺失的空白?是否有其他方法可以在查询中实现此目的?

您可以使用各种技术解决此问题。在下面的示例中,我使用了一种方法,因为代码是 SQL 例程的主体。

那么,这是你的约会对象:

DECLARE @Periods TABLE
(
    [PeriodId] INT
   ,[PeriodStart] DATE
   ,[PeriodEnd] DATE
);

INSERT INTO @Periods ([PeriodId], [PeriodStart], [PeriodEnd])
VALUES ('202005', '2020-05-01', '2020-05-31')
      ,('202006', '2020-06-01', '2020-06-30');

DECLARE @Activities  TABLE
(
    [ActivityId] INT
   ,[UserId] CHAR(1)
   ,[ActivityStart] DATE
   ,[ActivityEnd] DATE
);

INSERT INTO @Activities ([ActivityId], [UserId], [ActivityStart], [ActivityEnd])
VALUES (1, 'A', '2020-05-20', '2020-06-05')
      ,(2, 'A', '2020-06-15', '2020-06-18')
      ,(3, 'B', '2020-06-10', '2020-06-25');

然后,假设我们有一个输入参数 @PeriodID,我们通过它提取相应的开始和结束日期期间:

DECLARE @PeriodID INT
       ,@PeriodDateStart DATE
       ,@PeriodDateEnd DATE;

SET @PeriodID = 202006;

SELECT @PeriodDateStart = [PeriodStart]
      ,@PeriodDateEnd = [PeriodEnd]
FROM @Periods 
WHERE [PeriodId] = @PeriodID;

然后,让我们创建一个缓冲区 table,我们将在其中计算 activityperiod table 之间的匹配,并添加 startend 需要时记录:

DECLARE @Buffer TABLE
(
    [ActivityId] INT
   ,[UserId] CHAR(1)
   ,[ActivityStart] DATE
   ,[ActivityEnd] DATE
);

WITH DataSource AS
(
    SELECT A.[ActivityId]
          ,A.[UserId]
          ,A.[ActivityStart]
          ,A.[ActivityEnd]
    FROM @Activities A
    INNER JOIN @Periods P
        ON A.[ActivityStart] <= P.[PeriodEnd]
        AND A.[ActivityEnd] >= P.[PeriodStart]
    WHERE P.PeriodId = @PeriodID
)
INSERT INTO @Buffer ([ActivityId], [UserId], [ActivityStart], [ActivityEnd])
SELECT [ActivityId]
      ,[UserId]
      ,IIF([ActivityStart] < @PeriodDateStart, @PeriodDateStart, [ActivityStart]) AS [ActivityStart]
      ,[ActivityEnd]
FROM DataSource 
UNION ALL
SELECT NULL
      ,[UserId]
      ,DATEADD(DAY, 1, MAX([ActivityEnd]))
      ,@PeriodDateEnd
FROM DataSource
GROUP BY [UserId]
HAVING DATEADD(DAY, 1, MAX([ActivityEnd])) < @PeriodDateEnd
UNION ALL
SELECT NULL
      ,[UserId]
      ,@PeriodDateStart
      ,DATEADD(DAY, -1, MIN([ActivityStart]))
FROM DataSource
GROUP BY [UserId]
HAVING DATEADD(DAY, -1, MIN([ActivityStart])) > @PeriodDateStart;

很简单。在常见的 table 表达式中,我使用了您的代码。然后,我们只是简单地检查是否需要在特定用户的时间段之后 or/and 之前添加一条记录。

现在,我们可以计算差距了,对吧?这里有很多变体。我正在使用 LEAD 函数来计算每一行的 missing 周期。声明如下:

SELECT *
      ,DATEADD(DAY, 1, [ActivityEnd]) AS [MissingPeriodStart]
      ,DATEADD(DAY, -1, LEAD([ActivityStart]) OVER (PARTITION BY [UserID] ORDER BY [ActivityStart] ASC)) AS [MissingPeriodEnd]
FROM @Buffer
ORDER BY USERID, ActivityStart;

输出是这样的:

因此,您可能会看到我们如何为除最后一行之外的每一行生成 missing periods 日期。现在,我们只需要获取其中的一部分 missing periods。是这样的:

WITH DataSource AS
(
    SELECT *
          ,DATEADD(DAY, 1, [ActivityEnd]) AS [MissingPeriodStart]
          ,DATEADD(DAY, -1, LEAD([ActivityStart]) OVER (PARTITION BY [UserID] ORDER BY [ActivityStart] ASC)) AS [MissingPeriodEnd]
    FROM @Buffer
)
SELECT @PeriodID AS [PeriodID]
      ,[UserId]
      ,[ActivityId]
      ,[ActivityStart]
      ,[ActivityEnd]
FROM DataSource
UNION ALL 
SELECT @PeriodID AS [PeriodID]
      ,[UserId]
      ,NULL
      ,[MissingPeriodStart]
      ,[MissingPeriodEnd]
FROM DataSource
WHERE NOT EXISTS 
(
    SELECT 1 
    FROM DataSource DS
    WHERE [MissingPeriodStart] = DS.[ActivityStart]
        AND [UserID] = DS.[UserID]
)
    AND [MissingPeriodStart] < [MissingPeriodEnd]
ORDER BY [UserId]
        ,[ActivityStart];

结果是:

当然,这是一个想法。您可能需要更改或调整它以便与您的真实数据一起使用。我希望它能给你一个开始。

这不是我见过的最疯狂的差距问题,但它是一个很好的问题。

DECLARE @PeriodId int = 202006;

DECLARE @ps date, @pe date;
SELECT @ps = PeriodStart, @pe = PeriodEnd FROM dbo.Periods
   WHERE PeriodId = @PeriodId;
   
;WITH dates(rn,dt) AS 
(
    SELECT 1, @ps UNION ALL SELECT rn + 1, DATEADD(DAY, rn, @ps) 
    FROM dates WHERE dt < @pe
)
groups(UserId, dt, ActivityId, grp) AS
(
  SELECT u.UserId, d.dt, r.ActivityId, 
    d.rn - DENSE_RANK() OVER (PARTITION BY u.UserId, r.ActivityStart ORDER BY d.dt)
  FROM dates AS d CROSS JOIN (SELECT DISTINCT UserId FROM dbo.Activities 
    WHERE @pe >= ActivityStart AND @ps <= ActivityEnd) AS u
  LEFT OUTER JOIN dbo.Activities AS r
  ON u.UserId = r.UserId AND d.dt >= r.ActivityStart AND d.dt <= r.ActivityEnd
)
SELECT PeriodId = @PeriodId, UserId, ActivityId,
  NewActivityStart = MIN(dt),
  NewActivityEnd   = MAX(dt)
FROM groups 
GROUP BY UserId, ActivityId, grp
ORDER BY UserId, NewActivityStart;

如果经期可以超过 100 天,您需要在末尾 MAXRECURSION

OPTION (MAXRECURSION 32767);  

如果经期可以超过 32,767 天,请将 32767 更改为 0

已更新 fiddle here

我不认为这有那么复杂。如果将时间段扩展为单独的日期并执行 left join,那么这将成为 gaps-and-islands 问题:

with dates as (
      select periodid, periodstart as dte, periodend
      from periods
      union all
      select periodid, dateadd(day, 1, dte), periodend
      from dates
      where dte < periodend
     )
select userid, activityid, min(dte), max(dte)
from (select d.dte, d.periodid, u.userid, a.activityid,
             row_number() over (partition by u.userid, a.activityid order by d.dte) as seqnum
      from dates d cross join
           (select distinct userid from activities) u left join
           activities a
           on a.userid = u.userid and
              a.activitystart <= d.dte and a.activityend >= d.dte
     ) da
group by userid, activityid, periodid, dateadd(day, -seqnum, dte)
order by userid, min(dte);

Here 是一个 db<>fiddle.

注意:这会产生所有用户和所有时间段的结果——根据您的描述,这似乎是合理的。过滤掉给定时间段内没有activity的用户非常简单。

此外,这不会到月底。相反,它包括完整的句点。我不明白为什么几个月会影响这一点——除非混淆问题——例如,考虑两个时期是否在同一个月内有几天。