SQL - 更多属性的间隙和孤岛问题
SQL - gap and island issue for more attributes
我有以下 table,除了其他属性外还包含:
- ID - 唯一标识符
- 第 1 列
- 第 2 列
- 第 3 列
- CreatedDate - 创建记录的时间(基于 ETL)
- UpdatedDate - 直到记录有效
由于正在跟踪 3 列之外的其他属性以获取历史值,因此可能会出现这样的情况:对于同一 ID,所有三列的多行具有相同的值,但时间戳不同在 [创建日期] / [更新日期]。因此,数据可能如下所示:
ID
Column1
Column2
Column3
CreatedDate
UpdatedDate
1122
T1
In Progress
NULL
02/02/2022 18:39:38
29/03/2022 14:25:24
1122
T1
In Progress
NULL
05/01/2022 10:45:50
02/02/2022 18:39:38
1122
T1
In Progress
NULL
03/01/2022 12:11:47
05/01/2022 10:45:50
1122
T1
In Progress
Yes
13/12/2021 21:43:44
03/01/2022 12:11:47
1122
T1
In Progress
NULL
17/02/2021 14:12:15
13/12/2021 21:43:44
1122
T1
In Progress
NULL
22/12/2020 14:38:32
17/02/2021 14:12:15
1122
T1
In Progress
NULL
17/12/2020 18:38:38
22/12/2020 14:38:32
1122
T3
Ready
NULL
30/03/2020 14:35:18
17/12/2020 18:38:38
1122
NULL
Ready
NULL
04/09/2019 18:33:24
30/03/2020 14:35:18
1122
T2
Ready
NULL
07/01/2019 11:07:39
04/09/2019 18:33:24
1122
T2
Ready
NULL
17/09/2018 14:31:17
07/01/2019 11:07:39
1122
T0
Ready
NULL
28/08/2018 14:31:39
17/09/2018 14:31:17
1122
T0
Ready
NULL
13/02/2018 14:48:44
28/08/2018 14:31:39
我想以正确的顺序保留所有 3 列的唯一值,因此理想的输出应该如下所示:
ID
Column1
Column2
Column3
CreatedDate
UpdatedDate
1122
T1
In Progress
NULL
03/01/2022 12:11:47
29/03/2022 14:25:24
1122
T1
In Progress
Yes
13/12/2021 21:43:44
03/01/2022 12:11:47
1122
T1
In Progress
NULL
17/12/2020 18:38:38
13/12/2021 21:43:44
1122
T3
Ready
NULL
30/03/2020 14:35:18
17/12/2020 18:38:38
1122
NULL
Ready
NULL
04/09/2019 18:33:24
30/03/2020 14:35:18
1122
T2
Ready
NULL
17/09/2018 14:31:17
04/09/2019 18:33:24
1122
T0
Ready
NULL
13/02/2018 14:48:44
17/09/2018 14:31:17
如果只有一列,下面的代码工作正常,但它不适用于多列,因为它 returns 所有唯一行。
select ID, Column1, Column2, Column3, min(createddate), max(updateddate)
from (select t.*,
sum(case when prev_updatedate >= createddate then 0 else 1 end) over (partition by ID order by createddate) as grp
from (select h.*,
max(updateddate) over (partition by ID order by createddate rows between unbounded preceding and 1 preceding) as prev_updatedate
from #history h
) h
) h
group by ID, Column1, Column2, Column3, grp;
请问有什么解决办法吗?
你可以尝试使用ROW_NUMBER
window函数来弥补你的逻辑差距然后你可能会得到gaps-and-islands
的分组
SELECT ID,Column1,Column2,Column3, min(createddate) CreatedDate, max(updateddate) UpdatedDate
FROM (
select *,
ROW_NUMBER() over (partition by ID order by createddate) -
ROW_NUMBER() over (partition by ID,Column1,Column2,Column3 order by createddate) grp
from history
) t1
GROUP BY grp,ID,Column1,Column2,Column3
ORDER BY CreatedDate DESC
我有以下 table,除了其他属性外还包含:
- ID - 唯一标识符
- 第 1 列
- 第 2 列
- 第 3 列
- CreatedDate - 创建记录的时间(基于 ETL)
- UpdatedDate - 直到记录有效
由于正在跟踪 3 列之外的其他属性以获取历史值,因此可能会出现这样的情况:对于同一 ID,所有三列的多行具有相同的值,但时间戳不同在 [创建日期] / [更新日期]。因此,数据可能如下所示:
ID | Column1 | Column2 | Column3 | CreatedDate | UpdatedDate |
---|---|---|---|---|---|
1122 | T1 | In Progress | NULL | 02/02/2022 18:39:38 | 29/03/2022 14:25:24 |
1122 | T1 | In Progress | NULL | 05/01/2022 10:45:50 | 02/02/2022 18:39:38 |
1122 | T1 | In Progress | NULL | 03/01/2022 12:11:47 | 05/01/2022 10:45:50 |
1122 | T1 | In Progress | Yes | 13/12/2021 21:43:44 | 03/01/2022 12:11:47 |
1122 | T1 | In Progress | NULL | 17/02/2021 14:12:15 | 13/12/2021 21:43:44 |
1122 | T1 | In Progress | NULL | 22/12/2020 14:38:32 | 17/02/2021 14:12:15 |
1122 | T1 | In Progress | NULL | 17/12/2020 18:38:38 | 22/12/2020 14:38:32 |
1122 | T3 | Ready | NULL | 30/03/2020 14:35:18 | 17/12/2020 18:38:38 |
1122 | NULL | Ready | NULL | 04/09/2019 18:33:24 | 30/03/2020 14:35:18 |
1122 | T2 | Ready | NULL | 07/01/2019 11:07:39 | 04/09/2019 18:33:24 |
1122 | T2 | Ready | NULL | 17/09/2018 14:31:17 | 07/01/2019 11:07:39 |
1122 | T0 | Ready | NULL | 28/08/2018 14:31:39 | 17/09/2018 14:31:17 |
1122 | T0 | Ready | NULL | 13/02/2018 14:48:44 | 28/08/2018 14:31:39 |
我想以正确的顺序保留所有 3 列的唯一值,因此理想的输出应该如下所示:
ID | Column1 | Column2 | Column3 | CreatedDate | UpdatedDate |
---|---|---|---|---|---|
1122 | T1 | In Progress | NULL | 03/01/2022 12:11:47 | 29/03/2022 14:25:24 |
1122 | T1 | In Progress | Yes | 13/12/2021 21:43:44 | 03/01/2022 12:11:47 |
1122 | T1 | In Progress | NULL | 17/12/2020 18:38:38 | 13/12/2021 21:43:44 |
1122 | T3 | Ready | NULL | 30/03/2020 14:35:18 | 17/12/2020 18:38:38 |
1122 | NULL | Ready | NULL | 04/09/2019 18:33:24 | 30/03/2020 14:35:18 |
1122 | T2 | Ready | NULL | 17/09/2018 14:31:17 | 04/09/2019 18:33:24 |
1122 | T0 | Ready | NULL | 13/02/2018 14:48:44 | 17/09/2018 14:31:17 |
如果只有一列,下面的代码工作正常,但它不适用于多列,因为它 returns 所有唯一行。
select ID, Column1, Column2, Column3, min(createddate), max(updateddate)
from (select t.*,
sum(case when prev_updatedate >= createddate then 0 else 1 end) over (partition by ID order by createddate) as grp
from (select h.*,
max(updateddate) over (partition by ID order by createddate rows between unbounded preceding and 1 preceding) as prev_updatedate
from #history h
) h
) h
group by ID, Column1, Column2, Column3, grp;
请问有什么解决办法吗?
你可以尝试使用ROW_NUMBER
window函数来弥补你的逻辑差距然后你可能会得到gaps-and-islands
SELECT ID,Column1,Column2,Column3, min(createddate) CreatedDate, max(updateddate) UpdatedDate
FROM (
select *,
ROW_NUMBER() over (partition by ID order by createddate) -
ROW_NUMBER() over (partition by ID,Column1,Column2,Column3 order by createddate) grp
from history
) t1
GROUP BY grp,ID,Column1,Column2,Column3
ORDER BY CreatedDate DESC