SQL 压缩历史的服务器组/分区table

Question

得到 table 某人属于这样的特定类别的日期：

    drop table if exists #category
    create table #category (personid int, categoryid int, startdate datetime, enddate datetime)
    insert into #category 
    select * from 
    (
    select 1 Personid, 1 CategoryID, '01/04/2010' StartDate, '31/07/2016' EndDate union
    select 1 Personid, 5 CategoryID, '07/08/2016' StartDate, '31/03/2019' EndDate union
    select 1 Personid, 5 CategoryID, '01/04/2019' StartDate, '01/04/2019' EndDate union
    select 1 Personid, 5 CategoryID, '02/04/2019' StartDate, '11/08/2019' EndDate union
    select 1 Personid, 4 CategoryID, '12/08/2019' StartDate, '03/11/2019' EndDate union
    select 1 Personid, 5 CategoryID, '04/11/2019' StartDate, '22/03/2020' EndDate union
    select 1 Personid, 5 CategoryID, '23/03/2020' StartDate, NULL EndDate union
    select 2 Personid, 1 CategoryID, '01/04/2010' StartDate, '09/04/2015' EndDate union
    select 2 Personid, 4 CategoryID, '10/04/2015' StartDate, '31/03/2018' EndDate union
    select 2 Personid, 4 CategoryID, '01/04/2018' StartDate, '31/03/2019' EndDate union
    select 2 Personid, 4 CategoryID, '01/04/2019' StartDate, '23/06/2019' EndDate union
    select 2 Personid, 4 CategoryID, '24/06/2019' StartDate, NULL EndDate 
    ) x
    order by personid, startdate

我正在尝试压缩它，所以我得到了这个：

PersonID	categoryid	startdate	EndDate
1	1	01/04/2010	31/07/2016
1	5	07/08/2016	11/08/2019
1	4	12/08/2019	03/11/2019
1	5	04/11/2019	NULL
2	1	01/04/2010	09/04/2015
2	4	01/04/2015	NULL

我遇到了像 personid 1 这样的人的问题，他们在（例如）类别 5 中，然后进入类别 4，然后又回到类别 5。

所以做这样的事情：

select
personid,
categoryid,
min(startdate) startdate,
max(enddate) enddate
from #category
group by 
personid, categoryid

给我第 5 类第一期的最早日期和第二期的最晚日期 - 这意味着它创建了一个重叠期。

所以我尝试用 rownum 或 rank 对其进行分区，但它仍然做同样的事情——即将“类别 5”视为同一组：

select
rank() over (partition by personid, categoryid order by personid, startdate) rank,
c.*
from #category c
order by personid, startdate

rank	personid	categoryid	startdate	enddate
1	1	1	2010-04-01 00:00:00.000	2016-07-31 00:00:00.000
1	1	5	2016-08-07 00:00:00.000	2019-03-31 00:00:00.000
2	1	5	2019-04-01 00:00:00.000	2019-04-01 00:00:00.000
3	1	5	2019-04-02 00:00:00.000	2019-08-11 00:00:00.000
1	1	4	2019-08-12 00:00:00.000	2019-11-03 00:00:00.000
4	1	5	2019-11-04 00:00:00.000	2020-03-22 00:00:00.000
5	1	5	2020-03-23 00:00:00.000	NULL
1	2	1	2010-04-01 00:00:00.000	2015-04-09 00:00:00.000
1	2	4	2015-04-10 00:00:00.000	2018-03-31 00:00:00.000
2	2	4	2018-04-01 00:00:00.000	2019-03-31 00:00:00.000
3	2	4	2019-04-01 00:00:00.000	2019-06-23 00:00:00.000
4	2	4	2019-06-24 00:00:00.000	NULL

您可以在排名列中看到类别 5 从 1、2、3 开始，错过一行并继续进行 4、5，所以在同一分区中的 obvs - 我认为添加 order by 子句会强制当类别从 5 变为 4 并再次变回时，它会启动一个新分区。

有什么想法吗？

Answer 1

这是一种间隙和孤岛问题。但是，如果您的数据像在示例数据中那样完美地拼贴（没有间隙），那么您完全可以在不进行任何聚合的情况下执行此操作——这应该是最有效的方法：

select personid, categoryid, startdate,
       dateadd(day, -1, lead(startdate) over (partition by personid order by startdate)) as enddate
from (select c.*,
             lag(categoryid) over (partition by personid order by startdate) as prev_categoryid
      from #category c
     ) c
where prev_categoryid is null or prev_categoryid <> categoryid;

where子句只选择类别发生变化的行。 lead() 然后获取 下一个 开始日期 - 并为您想要的 enddate.

减去 1

SQL 压缩历史的服务器组/分区table

SQL server group / partition to condense history table

sql

sql-server

group-by

partition-by