识别 SQL 服务器 Table 中的连续块
Identify Consecutive Chunks in SQL Server Table
我有这个table:
ValueId bigint // (identity) item ID
ListId bigint // group ID
ValueDelta int // item value
ValueCreated datetime2 // item created
我需要的是在同一组中找到按创建时间排序的连续值,而不是 ID。 Created 和 ID 不保证顺序一致
所以输出应该是:
ListID bigint
FirstId bigint // from this ID (first in LID with Value ordered by Date)
LastId bigint // to this ID (last in LID with Value ordered by Date)
ValueDelta int // all share this value
ValueCount // and this many occurrences (number of items between FirstId and LastId)
我可以用 Cursors 做到这一点,但我确信这不是最好的主意,所以我想知道这是否可以在查询中完成。
请回答(如果有的话),稍微解释一下。
使用添加 Row_Number 列的 CTE,按 GroupId
和 Value
分区并按 Created
排序。
然后 select 来自 CTE,GROUP BY GroupId
和 Value
;使用 COUNT(*) 获取 Count
,并使用相关子查询 select ValueId
和 MIN(RowNumber)(始终为 1,因此您可以直接使用它MIN) 和 MAX(RowNumber) 得到 FirstId
和 LastId
.
不过,现在我注意到您正在使用 SQL Server 2017,您应该能够使用 First_Value() and Last_Value() 而不是相关子查询。
经过多次迭代,我认为我有一个可行的解决方案。我绝对确定它远非最佳,但它有效。
Link 在这里: http://sqlfiddle.com/#!18/4ee9f/3
示例数据:
create table [Value]
(
[ValueId] bigint not null identity(1,1),
[ListId] bigint not null,
[ValueDelta] int not null,
[ValueCreated] datetime2 not null,
constraint [PK_Value] primary key clustered ([ValueId])
);
insert into [Value]
([ListId], [ValueDelta], [ValueCreated])
values
(1, 1, '2019-01-01 01:01:01'), -- 1.1
(1, 0, '2019-01-01 01:02:01'), -- 2.1
(1, 0, '2019-01-01 01:03:01'), -- 2.2
(1, 0, '2019-01-01 01:04:01'), -- 2.3
(1, -1, '2019-01-01 01:05:01'), -- 3.1
(1, -1, '2019-01-01 01:06:01'), -- 3.2
(1, 1, '2019-01-01 01:01:02'), -- 1.2
(1, 1, '2019-01-01 01:08:01'), -- 4.2
(2, 1, '2019-01-01 01:08:01') -- 5.1
似乎有效的查询:
-- this is the actual order of data
select *
from [Value]
order by [ListId] asc, [ValueCreated] asc;
-- there are 4 sets here
-- set 1 GroupId=1, Id=1&7, Value=1
-- set 2 GroupId=1, Id=2-4, Value=0
-- set 3 GroupId=1, Id=5-6, Value=-1
-- set 4 GroupId=1, Id=8-8, Value=1
-- set 5 GroupId=2, Id=9-9, Value=1
with [cte1] as
(
select [v1].[ListId]
,[v2].[ValueId] as [FirstId], [v2].[ValueCreated] as [FirstCreated]
,[v1].[ValueId] as [LastId], [v1].[ValueCreated] as [LastCreated]
,isnull([v1].[ValueDelta], 0) as [ValueDelta]
from [dbo].[Value] [v1]
join [dbo].[Value] [v2] on [v2].[ListId] = [v1].[ListId]
and isnull([v2].[ValueDeltaPrev], 0) = isnull([v1].[ValueDeltaPrev], 0)
and [v2].[ValueCreated] <= [v1].[ValueCreated] and not exists (
select 1
from [dbo].[Value] [v3]
where 1=1
and ([v3].[ListId] = [v1].[ListId])
and ([v3].[ValueCreated] between [v2].[ValueCreated] and [v1].[ValueCreated])
and [v3].[ValueDelta] != [v1].[ValueDelta]
)
), [cte2] as
(
select [t1].*
from [cte1] [t1]
where not exists (select 1 from [cte1] [t2] where [t2].[ListId] = [t1].[ListId]
and ([t1].[FirstId] != [t2].[FirstId] or [t1].[LastId] != [t2].[LastId])
and [t1].[FirstCreated] between [t2].[FirstCreated] and [t2].[LastCreated]
and [t1].[LastCreated] between [t2].[FirstCreated] and [t2].[LastCreated]
)
)
select [ListId], [FirstId], [LastId], [FirstCreated], [LastCreated], [ValueDelta] as [ValueDelta]
,(select count(*) from [dbo].[Value] where [ListId] = [t].[ListId] and [ValueCreated] between [t].[FirstCreated] and [t].[LastCreated]) as [ValueCount]
from [cte2] [t];
工作原理:
- 加入 table 到同一个列表中的自己,但仅在较旧的(或处理单个集合的相同日期)值上
- 自己重新加入并排除任何重叠,只保留最大日期集
- 一旦我们确定了最大的集合,我们就会对集合日期中的条目进行计数
如果有人能找到更好/更友好的解决方案,您就会得到答案。
PS: 愚蠢直接的 Cursor 方法似乎比这快很多。还在测试中。
它看起来确实像一个缺口和孤岛问题。
这是一种方法。它可能比您的变体工作得更快。
gaps-and-islands 的标准思想是生成两组行号,以两种方式对它们进行分区。这些行号之间的差异 (rn1-rn2
) 在每个连续的块中将保持不变。 运行 CTE-by-CTE 下面的查询并检查中间结果以查看发生了什么。
WITH
CTE_RN
AS
(
SELECT
[ValueId]
,[ListId]
,[ValueDelta]
,[ValueCreated]
,ROW_NUMBER() OVER (PARTITION BY ListID ORDER BY ValueCreated) AS rn1
,ROW_NUMBER() OVER (PARTITION BY ListID, [ValueDelta] ORDER BY ValueCreated) AS rn2
FROM [Value]
)
SELECT
ListID
,MIN(ValueID) AS FirstID
,MAX(ValueID) AS LastID
,MIN(ValueCreated) AS FirstCreated
,MAX(ValueCreated) AS LastCreated
,ValueDelta
,COUNT(*) AS ValueCount
FROM CTE_RN
GROUP BY
ListID
,ValueDelta
,rn1-rn2
ORDER BY
FirstCreated
;
此查询在您的示例数据集上产生的结果与您的结果相同。
不太清楚FirstID
和LastID
是否可以是MIN
和MAX
,或者它们确实必须来自第一行和最后一行(排序时通过 ValueCreated)。如果你真的需要 first 和 last,查询会变得有点复杂。
在您的原始示例数据集中,FirstID
的 "first" 和 "min" 是相同的。让我们稍微更改示例数据集以突出显示此差异:
insert into [Value]
([ListId], [ValueDelta], [ValueCreated])
values
(1, 1, '2019-01-01 01:01:02'), -- 1.1
(1, 0, '2019-01-01 01:02:01'), -- 2.1
(1, 0, '2019-01-01 01:03:01'), -- 2.2
(1, 0, '2019-01-01 01:04:01'), -- 2.3
(1, -1, '2019-01-01 01:05:01'), -- 3.1
(1, -1, '2019-01-01 01:06:01'), -- 3.2
(1, 1, '2019-01-01 01:01:01'), -- 1.2
(1, 1, '2019-01-01 01:08:01'), -- 4.2
(2, 1, '2019-01-01 01:08:01') -- 5.1
;
我所做的只是交换第一行和第七行之间的 ValueCreated,所以现在第一组的 FirstID
是 7
,LastID
是 1
。您的查询 returns 正确结果。我上面的简单查询没有。
这是产生正确结果的变体。我决定使用 FIRST_VALUE
和 LAST_VALUE
函数来获取适当的 ID。同样,运行 逐个 CTE 查询并检查中间结果以查看发生了什么。
即使使用调整后的示例数据集,此变体也会产生与您的查询相同的结果。
WITH
CTE_RN
AS
(
SELECT
[ValueId]
,[ListId]
,[ValueDelta]
,[ValueCreated]
,ROW_NUMBER() OVER (PARTITION BY ListID ORDER BY ValueCreated) AS rn1
,ROW_NUMBER() OVER (PARTITION BY ListID, ValueDelta ORDER BY ValueCreated) AS rn2
FROM [Value]
)
,CTE2
AS
(
SELECT
ValueId
,ListId
,ValueDelta
,ValueCreated
,rn1
,rn2
,rn1-rn2 AS Diff
,FIRST_VALUE(ValueID) OVER(
PARTITION BY ListID, ValueDelta, rn1-rn2 ORDER BY ValueCreated
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS FirstID
,LAST_VALUE(ValueID) OVER(
PARTITION BY ListID, ValueDelta, rn1-rn2 ORDER BY ValueCreated
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS LastID
FROM CTE_RN
)
SELECT
ListID
,FirstID
,LastID
,MIN(ValueCreated) AS FirstCreated
,MAX(ValueCreated) AS LastCreated
,ValueDelta
,COUNT(*) AS ValueCount
FROM CTE2
GROUP BY
ListID
,ValueDelta
,rn1-rn2
,FirstID
,LastID
ORDER BY FirstCreated;
我有这个table:
ValueId bigint // (identity) item ID
ListId bigint // group ID
ValueDelta int // item value
ValueCreated datetime2 // item created
我需要的是在同一组中找到按创建时间排序的连续值,而不是 ID。 Created 和 ID 不保证顺序一致
所以输出应该是:
ListID bigint
FirstId bigint // from this ID (first in LID with Value ordered by Date)
LastId bigint // to this ID (last in LID with Value ordered by Date)
ValueDelta int // all share this value
ValueCount // and this many occurrences (number of items between FirstId and LastId)
我可以用 Cursors 做到这一点,但我确信这不是最好的主意,所以我想知道这是否可以在查询中完成。
请回答(如果有的话),稍微解释一下。
使用添加 Row_Number 列的 CTE,按 GroupId
和 Value
分区并按 Created
排序。
然后 select 来自 CTE,GROUP BY GroupId
和 Value
;使用 COUNT(*) 获取 Count
,并使用相关子查询 select ValueId
和 MIN(RowNumber)(始终为 1,因此您可以直接使用它MIN) 和 MAX(RowNumber) 得到 FirstId
和 LastId
.
不过,现在我注意到您正在使用 SQL Server 2017,您应该能够使用 First_Value() and Last_Value() 而不是相关子查询。
经过多次迭代,我认为我有一个可行的解决方案。我绝对确定它远非最佳,但它有效。
Link 在这里: http://sqlfiddle.com/#!18/4ee9f/3
示例数据:
create table [Value]
(
[ValueId] bigint not null identity(1,1),
[ListId] bigint not null,
[ValueDelta] int not null,
[ValueCreated] datetime2 not null,
constraint [PK_Value] primary key clustered ([ValueId])
);
insert into [Value]
([ListId], [ValueDelta], [ValueCreated])
values
(1, 1, '2019-01-01 01:01:01'), -- 1.1
(1, 0, '2019-01-01 01:02:01'), -- 2.1
(1, 0, '2019-01-01 01:03:01'), -- 2.2
(1, 0, '2019-01-01 01:04:01'), -- 2.3
(1, -1, '2019-01-01 01:05:01'), -- 3.1
(1, -1, '2019-01-01 01:06:01'), -- 3.2
(1, 1, '2019-01-01 01:01:02'), -- 1.2
(1, 1, '2019-01-01 01:08:01'), -- 4.2
(2, 1, '2019-01-01 01:08:01') -- 5.1
似乎有效的查询:
-- this is the actual order of data
select *
from [Value]
order by [ListId] asc, [ValueCreated] asc;
-- there are 4 sets here
-- set 1 GroupId=1, Id=1&7, Value=1
-- set 2 GroupId=1, Id=2-4, Value=0
-- set 3 GroupId=1, Id=5-6, Value=-1
-- set 4 GroupId=1, Id=8-8, Value=1
-- set 5 GroupId=2, Id=9-9, Value=1
with [cte1] as
(
select [v1].[ListId]
,[v2].[ValueId] as [FirstId], [v2].[ValueCreated] as [FirstCreated]
,[v1].[ValueId] as [LastId], [v1].[ValueCreated] as [LastCreated]
,isnull([v1].[ValueDelta], 0) as [ValueDelta]
from [dbo].[Value] [v1]
join [dbo].[Value] [v2] on [v2].[ListId] = [v1].[ListId]
and isnull([v2].[ValueDeltaPrev], 0) = isnull([v1].[ValueDeltaPrev], 0)
and [v2].[ValueCreated] <= [v1].[ValueCreated] and not exists (
select 1
from [dbo].[Value] [v3]
where 1=1
and ([v3].[ListId] = [v1].[ListId])
and ([v3].[ValueCreated] between [v2].[ValueCreated] and [v1].[ValueCreated])
and [v3].[ValueDelta] != [v1].[ValueDelta]
)
), [cte2] as
(
select [t1].*
from [cte1] [t1]
where not exists (select 1 from [cte1] [t2] where [t2].[ListId] = [t1].[ListId]
and ([t1].[FirstId] != [t2].[FirstId] or [t1].[LastId] != [t2].[LastId])
and [t1].[FirstCreated] between [t2].[FirstCreated] and [t2].[LastCreated]
and [t1].[LastCreated] between [t2].[FirstCreated] and [t2].[LastCreated]
)
)
select [ListId], [FirstId], [LastId], [FirstCreated], [LastCreated], [ValueDelta] as [ValueDelta]
,(select count(*) from [dbo].[Value] where [ListId] = [t].[ListId] and [ValueCreated] between [t].[FirstCreated] and [t].[LastCreated]) as [ValueCount]
from [cte2] [t];
工作原理:
- 加入 table 到同一个列表中的自己,但仅在较旧的(或处理单个集合的相同日期)值上
- 自己重新加入并排除任何重叠,只保留最大日期集
- 一旦我们确定了最大的集合,我们就会对集合日期中的条目进行计数
如果有人能找到更好/更友好的解决方案,您就会得到答案。
PS: 愚蠢直接的 Cursor 方法似乎比这快很多。还在测试中。
它看起来确实像一个缺口和孤岛问题。
这是一种方法。它可能比您的变体工作得更快。
gaps-and-islands 的标准思想是生成两组行号,以两种方式对它们进行分区。这些行号之间的差异 (rn1-rn2
) 在每个连续的块中将保持不变。 运行 CTE-by-CTE 下面的查询并检查中间结果以查看发生了什么。
WITH
CTE_RN
AS
(
SELECT
[ValueId]
,[ListId]
,[ValueDelta]
,[ValueCreated]
,ROW_NUMBER() OVER (PARTITION BY ListID ORDER BY ValueCreated) AS rn1
,ROW_NUMBER() OVER (PARTITION BY ListID, [ValueDelta] ORDER BY ValueCreated) AS rn2
FROM [Value]
)
SELECT
ListID
,MIN(ValueID) AS FirstID
,MAX(ValueID) AS LastID
,MIN(ValueCreated) AS FirstCreated
,MAX(ValueCreated) AS LastCreated
,ValueDelta
,COUNT(*) AS ValueCount
FROM CTE_RN
GROUP BY
ListID
,ValueDelta
,rn1-rn2
ORDER BY
FirstCreated
;
此查询在您的示例数据集上产生的结果与您的结果相同。
不太清楚FirstID
和LastID
是否可以是MIN
和MAX
,或者它们确实必须来自第一行和最后一行(排序时通过 ValueCreated)。如果你真的需要 first 和 last,查询会变得有点复杂。
在您的原始示例数据集中,FirstID
的 "first" 和 "min" 是相同的。让我们稍微更改示例数据集以突出显示此差异:
insert into [Value]
([ListId], [ValueDelta], [ValueCreated])
values
(1, 1, '2019-01-01 01:01:02'), -- 1.1
(1, 0, '2019-01-01 01:02:01'), -- 2.1
(1, 0, '2019-01-01 01:03:01'), -- 2.2
(1, 0, '2019-01-01 01:04:01'), -- 2.3
(1, -1, '2019-01-01 01:05:01'), -- 3.1
(1, -1, '2019-01-01 01:06:01'), -- 3.2
(1, 1, '2019-01-01 01:01:01'), -- 1.2
(1, 1, '2019-01-01 01:08:01'), -- 4.2
(2, 1, '2019-01-01 01:08:01') -- 5.1
;
我所做的只是交换第一行和第七行之间的 ValueCreated,所以现在第一组的 FirstID
是 7
,LastID
是 1
。您的查询 returns 正确结果。我上面的简单查询没有。
这是产生正确结果的变体。我决定使用 FIRST_VALUE
和 LAST_VALUE
函数来获取适当的 ID。同样,运行 逐个 CTE 查询并检查中间结果以查看发生了什么。
即使使用调整后的示例数据集,此变体也会产生与您的查询相同的结果。
WITH
CTE_RN
AS
(
SELECT
[ValueId]
,[ListId]
,[ValueDelta]
,[ValueCreated]
,ROW_NUMBER() OVER (PARTITION BY ListID ORDER BY ValueCreated) AS rn1
,ROW_NUMBER() OVER (PARTITION BY ListID, ValueDelta ORDER BY ValueCreated) AS rn2
FROM [Value]
)
,CTE2
AS
(
SELECT
ValueId
,ListId
,ValueDelta
,ValueCreated
,rn1
,rn2
,rn1-rn2 AS Diff
,FIRST_VALUE(ValueID) OVER(
PARTITION BY ListID, ValueDelta, rn1-rn2 ORDER BY ValueCreated
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS FirstID
,LAST_VALUE(ValueID) OVER(
PARTITION BY ListID, ValueDelta, rn1-rn2 ORDER BY ValueCreated
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS LastID
FROM CTE_RN
)
SELECT
ListID
,FirstID
,LastID
,MIN(ValueCreated) AS FirstCreated
,MAX(ValueCreated) AS LastCreated
,ValueDelta
,COUNT(*) AS ValueCount
FROM CTE2
GROUP BY
ListID
,ValueDelta
,rn1-rn2
,FirstID
,LastID
ORDER BY FirstCreated;