如何通过排序、分区和分组进行行编号
How to make row numbering with ordering, partitioning and grouping
我需要通过排序、分区和分组来进行行编号。按 IdDocument, DateChange
排序,按 IdDocument
分区,按 IdRole
分组。问题尤其在于分组。从示例中可以看出 (NumberingExpected
) DENSE_RANK()
必须是用于此目的的最佳函数,但仅当用于排序的值相同时才会重复编号。在我的例子中,用于排序的值 (IdDocument, DateChange
) 总是不同的,并且编号的重复必须由 IdRole
.
完成
当然可以很容易地通过游标的使用来解决。但是有什么方法可以用 numbering/ranking 函数来实现吗?
测试数据:
declare @LogTest as table (
Id INT
,IdRole INT
,DateChange DATETIME
,IdDocument INT
,NumberingExpected INT
)
insert into @LogTest
select 1 as Id, 7 as IdRole, GETDATE() as DateChange, 13 as IdDocument, 1 as NumberingExpected
union
select 2, 3, DATEADD(HH, 1, GETDATE()), 13, 2
union
select 3, 3, DATEADD(HH, 2, GETDATE()), 13, 2
union
select 4, 3, DATEADD(HH, 3, GETDATE()), 13, 2
union
select 5, 5, DATEADD(HH, 4, GETDATE()), 13, 3
union
select 7, 3, DATEADD(HH, 6, GETDATE()), 13, 4
union
select 6, 3, DATEADD(HH, 5, GETDATE()), 27, 1
union
select 8, 3, DATEADD(HH, 7, GETDATE()), 27, 1
union
select 9, 5, DATEADD(HH, 8, GETDATE()), 27, 2
union
select 10, 3, DATEADD(HH, 9, GETDATE()), 27, 3
select * from @LogTest order by IdDocument, DateChange;
函数式编程方面的解释:
- 按 IdDocument、DateChange 排序数据
- 将第一行编号设置为 i=1 转到下一行
- 如果 IdDocument 已更改
{ 我 = 1; }
别的 {
如果 IdRow 改变了 { i++; }
}
- 设置行号为 i;
- 转到下一行;
- IF EOF { 退出; } else { 转到第 3 步; }
WITH RankByIdDocumentAndDataChanged AS
(
SELECT *,
CASE
IdRole - LAG(IdRole) OVER (PARTITION BY IdDocument ORDER BY DateChange)
WHEN 0 THEN 0
ELSE 1
END AS DIFF
FROM @LogTest
)
select *, SUM(DIFF) OVER (PARTITION BY IdDocument ORDER BY DateChange)
from RankByIdDocumentAndDataChanged
ORDER BY Id
这可能不太漂亮,但它确实创建了所需的输出。
; with cte as (
select l.Id,l.IdRole,l.IdDocument,l.NumberingExpected,l.DateChange,
(select min(x.DateChange) from @LogTest x where x.IdDocument = l.IdDocument and x.IdRole = l.IdRole and x.id<=l.id and
x.id > (select max(y.id) from @LogTest y where y.IdDocument = l.IdDocument and y.IdRole <> l.IdRole and y.id <=l.Id)) as DateChange2
from @LogTest l
)
select c.Id,c.IdRole,c.DateChange,c.IdDocument,c.NumberingExpected,dense_rank() over (partition by c.IdDocument order by c.DateChange2) as rn
from cte c order by c.IdDocument, c.DateChange;
如果我有更多时间,我认为 CTE 中的 x.id 谓词可以改进。
自 2012 年起您可以使用 LAG
/LEAD
,但在 2008 年它不可用,因此我们将效仿它。性能可能很差,你应该检查你的实际数据。
这是最终查询:
WITH
CTE_rn
AS
(
SELECT
Main.IdRole
,Main.IdDocument
,Main.DateChange
,ROW_NUMBER() OVER(PARTITION BY Main.IdDocument ORDER BY Main.DateChange) AS rn
FROM
@LogTest AS Main
OUTER APPLY
(
SELECT TOP (1) T.IdRole
FROM @LogTest AS T
WHERE
T.IdDocument = Main.IdDocument
AND T.DateChange < Main.DateChange
ORDER BY T.DateChange DESC
) AS Prev
WHERE Main.IdRole <> Prev.IdRole OR Prev.IdRole IS NULL
)
SELECT *
FROM
@LogTest AS LT
CROSS APPLY
(
SELECT TOP(1) CTE_rn.rn
FROM CTE_rn
WHERE
CTE_rn.IdDocument = LT.IdDocument
AND CTE_rn.IdRole = LT.IdRole
AND CTE_rn.DateChange <= LT.DateChange
ORDER BY CTE_rn.DateChange DESC
) CA_rn
ORDER BY IdDocument, DateChange;
最终结果集:
Id IdRole DateChange IdDocument NumberingExpected rn
1 7 2015-01-26 20:00:41.210 13 1 1
2 3 2015-01-26 21:00:41.210 13 2 2
3 3 2015-01-26 22:00:41.210 13 2 2
4 3 2015-01-26 23:00:41.210 13 2 2
5 5 2015-01-27 00:00:41.210 13 3 3
7 3 2015-01-27 02:00:41.210 13 4 4
6 3 2015-01-27 01:00:41.210 27 1 1
8 3 2015-01-27 03:00:41.210 27 1 1
9 5 2015-01-27 04:00:41.210 27 2 2
10 3 2015-01-27 05:00:41.210 27 3 3
工作原理
1) 当 table 按 IdDocument 和 DateChange 排序时,我们需要上一行的 IdRole 值。要获得它,我们使用 OUTER APPLY
(因为 LAG
不可用):
SELECT *
FROM
@LogTest AS Main
OUTER APPLY
(
SELECT TOP (1) T.IdRole
FROM @LogTest AS T
WHERE
T.IdDocument = Main.IdDocument
AND T.DateChange < Main.DateChange
ORDER BY T.DateChange DESC
) AS Prev
ORDER BY Main.IdDocument, Main.DateChange;
这是第一步的结果集:
Id IdRole DateChange IdDocument NumberingExpected IdRole
1 7 2015-01-26 20:50:32.560 13 1 NULL
2 3 2015-01-26 21:50:32.560 13 2 7
3 3 2015-01-26 22:50:32.560 13 2 3
4 3 2015-01-26 23:50:32.560 13 2 3
5 5 2015-01-27 00:50:32.560 13 3 3
7 3 2015-01-27 02:50:32.560 13 4 5
6 3 2015-01-27 01:50:32.560 27 1 NULL
8 3 2015-01-27 03:50:32.560 27 1 3
9 5 2015-01-27 04:50:32.560 27 2 3
10 3 2015-01-27 05:50:32.560 27 3 5
2) 我们想要删除具有重复 IdRole 的行,因此我们添加一个 WHERE
并对行进行编号。您可以看到行号符合预期结果:
SELECT
Main.IdRole
,Main.IdDocument
,Main.DateChange
,ROW_NUMBER() OVER(PARTITION BY Main.IdDocument ORDER BY Main.DateChange) AS rn
FROM
@LogTest AS Main
OUTER APPLY
(
SELECT TOP (1) T.IdRole
FROM @LogTest AS T
WHERE
T.IdDocument = Main.IdDocument
AND T.DateChange < Main.DateChange
ORDER BY T.DateChange DESC
) AS Prev
WHERE Main.IdRole <> Prev.IdRole OR Prev.IdRole IS NULL
;
这是这一步的结果集(它成为 CTE):
IdRole IdDocument DateChange rn
7 13 2015-01-26 20:13:26.247 1
3 13 2015-01-26 21:13:26.247 2
5 13 2015-01-27 00:13:26.247 3
3 13 2015-01-27 02:13:26.247 4
3 27 2015-01-27 01:13:26.247 1
5 27 2015-01-27 04:13:26.247 2
3 27 2015-01-27 05:13:26.247 3
3) 最后,我们需要从 CTE 中为原始 table 的每一行获取正确的行号。我使用 CROSS APPLY
为原始 table.
的每一行从 CTE 获取一行
我需要通过排序、分区和分组来进行行编号。按 IdDocument, DateChange
排序,按 IdDocument
分区,按 IdRole
分组。问题尤其在于分组。从示例中可以看出 (NumberingExpected
) DENSE_RANK()
必须是用于此目的的最佳函数,但仅当用于排序的值相同时才会重复编号。在我的例子中,用于排序的值 (IdDocument, DateChange
) 总是不同的,并且编号的重复必须由 IdRole
.
当然可以很容易地通过游标的使用来解决。但是有什么方法可以用 numbering/ranking 函数来实现吗?
测试数据:
declare @LogTest as table (
Id INT
,IdRole INT
,DateChange DATETIME
,IdDocument INT
,NumberingExpected INT
)
insert into @LogTest
select 1 as Id, 7 as IdRole, GETDATE() as DateChange, 13 as IdDocument, 1 as NumberingExpected
union
select 2, 3, DATEADD(HH, 1, GETDATE()), 13, 2
union
select 3, 3, DATEADD(HH, 2, GETDATE()), 13, 2
union
select 4, 3, DATEADD(HH, 3, GETDATE()), 13, 2
union
select 5, 5, DATEADD(HH, 4, GETDATE()), 13, 3
union
select 7, 3, DATEADD(HH, 6, GETDATE()), 13, 4
union
select 6, 3, DATEADD(HH, 5, GETDATE()), 27, 1
union
select 8, 3, DATEADD(HH, 7, GETDATE()), 27, 1
union
select 9, 5, DATEADD(HH, 8, GETDATE()), 27, 2
union
select 10, 3, DATEADD(HH, 9, GETDATE()), 27, 3
select * from @LogTest order by IdDocument, DateChange;
函数式编程方面的解释:
- 按 IdDocument、DateChange 排序数据
- 将第一行编号设置为 i=1 转到下一行
- 如果 IdDocument 已更改 { 我 = 1; } 别的 { 如果 IdRow 改变了 { i++; } }
- 设置行号为 i;
- 转到下一行;
- IF EOF { 退出; } else { 转到第 3 步; }
WITH RankByIdDocumentAndDataChanged AS
(
SELECT *,
CASE
IdRole - LAG(IdRole) OVER (PARTITION BY IdDocument ORDER BY DateChange)
WHEN 0 THEN 0
ELSE 1
END AS DIFF
FROM @LogTest
)
select *, SUM(DIFF) OVER (PARTITION BY IdDocument ORDER BY DateChange)
from RankByIdDocumentAndDataChanged
ORDER BY Id
这可能不太漂亮,但它确实创建了所需的输出。
; with cte as (
select l.Id,l.IdRole,l.IdDocument,l.NumberingExpected,l.DateChange,
(select min(x.DateChange) from @LogTest x where x.IdDocument = l.IdDocument and x.IdRole = l.IdRole and x.id<=l.id and
x.id > (select max(y.id) from @LogTest y where y.IdDocument = l.IdDocument and y.IdRole <> l.IdRole and y.id <=l.Id)) as DateChange2
from @LogTest l
)
select c.Id,c.IdRole,c.DateChange,c.IdDocument,c.NumberingExpected,dense_rank() over (partition by c.IdDocument order by c.DateChange2) as rn
from cte c order by c.IdDocument, c.DateChange;
如果我有更多时间,我认为 CTE 中的 x.id 谓词可以改进。
自 2012 年起您可以使用 LAG
/LEAD
,但在 2008 年它不可用,因此我们将效仿它。性能可能很差,你应该检查你的实际数据。
这是最终查询:
WITH
CTE_rn
AS
(
SELECT
Main.IdRole
,Main.IdDocument
,Main.DateChange
,ROW_NUMBER() OVER(PARTITION BY Main.IdDocument ORDER BY Main.DateChange) AS rn
FROM
@LogTest AS Main
OUTER APPLY
(
SELECT TOP (1) T.IdRole
FROM @LogTest AS T
WHERE
T.IdDocument = Main.IdDocument
AND T.DateChange < Main.DateChange
ORDER BY T.DateChange DESC
) AS Prev
WHERE Main.IdRole <> Prev.IdRole OR Prev.IdRole IS NULL
)
SELECT *
FROM
@LogTest AS LT
CROSS APPLY
(
SELECT TOP(1) CTE_rn.rn
FROM CTE_rn
WHERE
CTE_rn.IdDocument = LT.IdDocument
AND CTE_rn.IdRole = LT.IdRole
AND CTE_rn.DateChange <= LT.DateChange
ORDER BY CTE_rn.DateChange DESC
) CA_rn
ORDER BY IdDocument, DateChange;
最终结果集:
Id IdRole DateChange IdDocument NumberingExpected rn
1 7 2015-01-26 20:00:41.210 13 1 1
2 3 2015-01-26 21:00:41.210 13 2 2
3 3 2015-01-26 22:00:41.210 13 2 2
4 3 2015-01-26 23:00:41.210 13 2 2
5 5 2015-01-27 00:00:41.210 13 3 3
7 3 2015-01-27 02:00:41.210 13 4 4
6 3 2015-01-27 01:00:41.210 27 1 1
8 3 2015-01-27 03:00:41.210 27 1 1
9 5 2015-01-27 04:00:41.210 27 2 2
10 3 2015-01-27 05:00:41.210 27 3 3
工作原理
1) 当 table 按 IdDocument 和 DateChange 排序时,我们需要上一行的 IdRole 值。要获得它,我们使用 OUTER APPLY
(因为 LAG
不可用):
SELECT *
FROM
@LogTest AS Main
OUTER APPLY
(
SELECT TOP (1) T.IdRole
FROM @LogTest AS T
WHERE
T.IdDocument = Main.IdDocument
AND T.DateChange < Main.DateChange
ORDER BY T.DateChange DESC
) AS Prev
ORDER BY Main.IdDocument, Main.DateChange;
这是第一步的结果集:
Id IdRole DateChange IdDocument NumberingExpected IdRole
1 7 2015-01-26 20:50:32.560 13 1 NULL
2 3 2015-01-26 21:50:32.560 13 2 7
3 3 2015-01-26 22:50:32.560 13 2 3
4 3 2015-01-26 23:50:32.560 13 2 3
5 5 2015-01-27 00:50:32.560 13 3 3
7 3 2015-01-27 02:50:32.560 13 4 5
6 3 2015-01-27 01:50:32.560 27 1 NULL
8 3 2015-01-27 03:50:32.560 27 1 3
9 5 2015-01-27 04:50:32.560 27 2 3
10 3 2015-01-27 05:50:32.560 27 3 5
2) 我们想要删除具有重复 IdRole 的行,因此我们添加一个 WHERE
并对行进行编号。您可以看到行号符合预期结果:
SELECT
Main.IdRole
,Main.IdDocument
,Main.DateChange
,ROW_NUMBER() OVER(PARTITION BY Main.IdDocument ORDER BY Main.DateChange) AS rn
FROM
@LogTest AS Main
OUTER APPLY
(
SELECT TOP (1) T.IdRole
FROM @LogTest AS T
WHERE
T.IdDocument = Main.IdDocument
AND T.DateChange < Main.DateChange
ORDER BY T.DateChange DESC
) AS Prev
WHERE Main.IdRole <> Prev.IdRole OR Prev.IdRole IS NULL
;
这是这一步的结果集(它成为 CTE):
IdRole IdDocument DateChange rn
7 13 2015-01-26 20:13:26.247 1
3 13 2015-01-26 21:13:26.247 2
5 13 2015-01-27 00:13:26.247 3
3 13 2015-01-27 02:13:26.247 4
3 27 2015-01-27 01:13:26.247 1
5 27 2015-01-27 04:13:26.247 2
3 27 2015-01-27 05:13:26.247 3
3) 最后,我们需要从 CTE 中为原始 table 的每一行获取正确的行号。我使用 CROSS APPLY
为原始 table.