合并 T-SQL 中连续行的组,并对每组的值求和
Merge groups of consecutive rows in T-SQL and sum values from each group
2019 年 10 月 8 日更新:
@Gordon Linoff:我尝试应用您的解决方案,但我意识到它没有按预期工作。我在此处添加了一个带有注释的预期结果示例 (https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=1b486476d6aeab25997f25e66ee455e9),如果您能帮助我,我将不胜感激。
--
我有一个 table 的交易模式:
CREATE TABLE Transactions (Id int IDENTITY, SessionId int, TransactionType varchar(50), DateTimeEnd datetime, DateStart datetime, Rank int);
以下是一些行示例:
INSERT INTO Transactions (Id, SessionId, TransactionType, DateTimeEnd, DateStart, Rank)
VALUES
(1, 1, 'Deposit', '2017-01-20T11:16:33Z', '2017-01-20T11:16:33Z', 600),
(2, 1, 'Withdrawal', '2017-01-21T11:16:33Z', '2017-01-20T11:16:33Z', 100),
(3, 2, 'Deposit', '2017-02-23T11:16:33Z', '2017-02-23T11:16:33Z', 500),
(4, 1, 'Withdrawal', '2017-01-24T11:16:33Z', '2017-01-21T11:16:33Z', 150),
(5, 1, 'Withdrawal', '2017-01-26T11:16:33Z', '2017-01-24T11:16:33Z', 150),
(6, 2, 'Withdrawal', '2017-02-27T11:16:33Z', '2017-02-23T11:16:33Z', 200),
(7, 1, 'Withdrawal', '2017-01-28T11:16:33Z', '2017-01-26T11:16:33Z', 10),
(8, 1, 'Withdrawal', '2017-01-30T11:16:33Z', '2017-01-28T11:16:33Z', 10),
(9, 1, 'Withdrawal', '2017-01-31T11:16:33Z', '2017-01-30T11:16:33Z', 10);
我想要的是一个 T-SQL 查询,用于按 SessionId、TransactionType 和每个组合并连续行的组,以仅保留具有最小 DateTimeEnd 的行。此外,保留的行的等级值必须是组中各行等级值的总和。 T-SQL 查询需要 运行 in MS SQL Server in Microsoft Azure SQL Data Warehouse.
期望的结果:
| Id | SessionId | Transaction | DateTimeEnd | DateStart | Rank |
|----------|------------------|-------------|--------------------|--------------------|---------|
| 1 | 1 | Deposit|2017-01-20T11:16:33Z|2017-01-20T11:16:33Z| 600 |
| 2 | 1 | Withdrawal|2017-01-21T11:16:33Z|2017-01-20T11:16:33Z| 100 |
| 4 | 1 | Withdrawal|2017-01-24T11:16:33Z|2017-01-21T11:16:33Z| 300 |
| 7 | 1 | Withdrawal|2017-01-28T11:16:33Z|2017-01-26T11:16:33Z| 30 |
| 3 | 2 | Deposit|2017-02-23T11:16:33Z|2017-02-23T11:16:33Z| 500 |
| 6 | 2 | Withdrawal|2017-02-27T11:16:33Z|2017-02-23T11:16:33Z| 200 |
我试了很多方法都无法实现。
这是一个间隙和岛屿变体。
我会按如下方式处理:
1) 首先,识别并合并记录组。以下查询为您提供每个组的组最小值 DateTimeEnd
,以及排名总和
SELECT
SessionId,
TransactionType,
SUM(Rank) SumRank,
MIN(DateTimeEnd) MinDateTimeEnd
FROM (
SELECT
t.*,
ROW_NUMBER() OVER(ORDER BY DateTimeEnd) rn1,
ROW_NUMBER() OVER(PARTITION BY SessionId, TransactionType ORDER BY DateTimeEnd) rn2
FROM Transactions t
) x
GROUP BY SessionId, TransactionType, rn1 - rn2
Returns:
SessionId | TransactionType | SumRank | MinDateTimeEnd
--------: | :-------------- | ------: | :------------------
1 | Deposit | 600 | 20/01/2017 11:16:33
1 | Withdrawal | 430 | 21/01/2017 11:16:33
2 | Deposit | 500 | 23/02/2017 11:16:33
2 | Withdrawal | 200 | 27/02/2017 11:16:33
2) 然后,将上面查询的结果与原来的table连接起来,取出剩下的列:
SELECT
t.id,
t.SessionId,
t.TransactionType,
t.DateTimeEnd,
t.DateStart,
x.SumRank
FROM Transactions t
INNER JOIN (
SELECT
SessionId,
TransactionType,
SUM(Rank) SumRank,
MIN(DateTimeEnd) MinDateTimeEnd
FROM (
SELECT
t.*,
ROW_NUMBER() OVER(ORDER BY DateTimeEnd) rn1,
ROW_NUMBER() OVER(PARTITION BY SessionId, TransactionType ORDER BY DateTimeEnd) rn2
FROM Transactions t
) x
GROUP BY SessionId, TransactionType, rn1 - rn2
) x
ON x.SessionId = t.SessionId
AND x.TransactionType = t.TransactionType
AND x.MinDateTimeEnd = t.DateTimeEnd
产量:
id | SessionId | TransactionType | DateTimeEnd | DateStart | SumRank
-: | --------: | :-------------- | :------------------ | :------------------ | ------:
1 | 1 | Deposit | 20/01/2017 11:16:33 | 20/01/2017 11:16:33 | 600
2 | 1 | Withdrawal | 21/01/2017 11:16:33 | 20/01/2017 11:16:33 | 430
3 | 2 | Deposit | 23/02/2017 11:16:33 | 23/02/2017 11:16:33 | 500
6 | 2 | Withdrawal | 27/02/2017 11:16:33 | 23/02/2017 11:16:33 | 200
注意:如评论所述,我认为您显示的预期结果存在问题。具有 id
s 4
和 7
的行不应出现在输出中,因为具有 id 2
的行具有相同的 SessionId
和 TransactionType
并且较早的 DateTimeEnd
。
正如 GMB 指出的那样,这是一个间隙和孤岛问题。因为您想保留第一行,所以我建议使用 lag()
方法而不是行号的差异:
SELECT SessionId, TransactionType, DateTimeEnd,DateStart, sumRank
FROM (SELECT t.*,
SUM(Rank) OVER (PARTITION BY SessionId, TransactionType, grp) as sumRank
FROM (SELECT t.*,
SUM(CASE WHEN prev_st_id = prev_id THEN 0 ELSE 1 END) OVER (ORDER BY id) as grp
FROM (SELECT t.*,
LAG(id) OVER (PARTITION BY SessionId, TransactionType ORDER BY id) as prev_st_id,
LAG(id) OVER (PARTITION BY SessionId ORDER BY id) as prev_id
FROM Transactions t
) t
) t
) t
WHERE prev_st_id <> prev_id OR prev_st_id IS NULL;
这是做什么的?
- 最内层子查询计算 id 的整体滞后和 session/transaction 类型的滞后。这使用
id
因为它看起来比 date/times 更稳定(其中一列中有重复的 date/time 值)。
- 当id不同时,则识别出一个新的岛屿。累积总和标识组。
- 然后
grp
使用 window 函数计算整个组的值。
- 然后外部查询只过滤到每个组中的第一行。
Here 是一个 db<>fiddle.
2019 年 10 月 8 日更新:
@Gordon Linoff:我尝试应用您的解决方案,但我意识到它没有按预期工作。我在此处添加了一个带有注释的预期结果示例 (https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=1b486476d6aeab25997f25e66ee455e9),如果您能帮助我,我将不胜感激。
--
我有一个 table 的交易模式:
CREATE TABLE Transactions (Id int IDENTITY, SessionId int, TransactionType varchar(50), DateTimeEnd datetime, DateStart datetime, Rank int);
以下是一些行示例:
INSERT INTO Transactions (Id, SessionId, TransactionType, DateTimeEnd, DateStart, Rank)
VALUES
(1, 1, 'Deposit', '2017-01-20T11:16:33Z', '2017-01-20T11:16:33Z', 600),
(2, 1, 'Withdrawal', '2017-01-21T11:16:33Z', '2017-01-20T11:16:33Z', 100),
(3, 2, 'Deposit', '2017-02-23T11:16:33Z', '2017-02-23T11:16:33Z', 500),
(4, 1, 'Withdrawal', '2017-01-24T11:16:33Z', '2017-01-21T11:16:33Z', 150),
(5, 1, 'Withdrawal', '2017-01-26T11:16:33Z', '2017-01-24T11:16:33Z', 150),
(6, 2, 'Withdrawal', '2017-02-27T11:16:33Z', '2017-02-23T11:16:33Z', 200),
(7, 1, 'Withdrawal', '2017-01-28T11:16:33Z', '2017-01-26T11:16:33Z', 10),
(8, 1, 'Withdrawal', '2017-01-30T11:16:33Z', '2017-01-28T11:16:33Z', 10),
(9, 1, 'Withdrawal', '2017-01-31T11:16:33Z', '2017-01-30T11:16:33Z', 10);
我想要的是一个 T-SQL 查询,用于按 SessionId、TransactionType 和每个组合并连续行的组,以仅保留具有最小 DateTimeEnd 的行。此外,保留的行的等级值必须是组中各行等级值的总和。 T-SQL 查询需要 运行 in MS SQL Server in Microsoft Azure SQL Data Warehouse.
期望的结果:
| Id | SessionId | Transaction | DateTimeEnd | DateStart | Rank |
|----------|------------------|-------------|--------------------|--------------------|---------|
| 1 | 1 | Deposit|2017-01-20T11:16:33Z|2017-01-20T11:16:33Z| 600 |
| 2 | 1 | Withdrawal|2017-01-21T11:16:33Z|2017-01-20T11:16:33Z| 100 |
| 4 | 1 | Withdrawal|2017-01-24T11:16:33Z|2017-01-21T11:16:33Z| 300 |
| 7 | 1 | Withdrawal|2017-01-28T11:16:33Z|2017-01-26T11:16:33Z| 30 |
| 3 | 2 | Deposit|2017-02-23T11:16:33Z|2017-02-23T11:16:33Z| 500 |
| 6 | 2 | Withdrawal|2017-02-27T11:16:33Z|2017-02-23T11:16:33Z| 200 |
我试了很多方法都无法实现。
这是一个间隙和岛屿变体。
我会按如下方式处理:
1) 首先,识别并合并记录组。以下查询为您提供每个组的组最小值 DateTimeEnd
,以及排名总和
SELECT
SessionId,
TransactionType,
SUM(Rank) SumRank,
MIN(DateTimeEnd) MinDateTimeEnd
FROM (
SELECT
t.*,
ROW_NUMBER() OVER(ORDER BY DateTimeEnd) rn1,
ROW_NUMBER() OVER(PARTITION BY SessionId, TransactionType ORDER BY DateTimeEnd) rn2
FROM Transactions t
) x
GROUP BY SessionId, TransactionType, rn1 - rn2
Returns:
SessionId | TransactionType | SumRank | MinDateTimeEnd --------: | :-------------- | ------: | :------------------ 1 | Deposit | 600 | 20/01/2017 11:16:33 1 | Withdrawal | 430 | 21/01/2017 11:16:33 2 | Deposit | 500 | 23/02/2017 11:16:33 2 | Withdrawal | 200 | 27/02/2017 11:16:33
2) 然后,将上面查询的结果与原来的table连接起来,取出剩下的列:
SELECT
t.id,
t.SessionId,
t.TransactionType,
t.DateTimeEnd,
t.DateStart,
x.SumRank
FROM Transactions t
INNER JOIN (
SELECT
SessionId,
TransactionType,
SUM(Rank) SumRank,
MIN(DateTimeEnd) MinDateTimeEnd
FROM (
SELECT
t.*,
ROW_NUMBER() OVER(ORDER BY DateTimeEnd) rn1,
ROW_NUMBER() OVER(PARTITION BY SessionId, TransactionType ORDER BY DateTimeEnd) rn2
FROM Transactions t
) x
GROUP BY SessionId, TransactionType, rn1 - rn2
) x
ON x.SessionId = t.SessionId
AND x.TransactionType = t.TransactionType
AND x.MinDateTimeEnd = t.DateTimeEnd
产量:
id | SessionId | TransactionType | DateTimeEnd | DateStart | SumRank -: | --------: | :-------------- | :------------------ | :------------------ | ------: 1 | 1 | Deposit | 20/01/2017 11:16:33 | 20/01/2017 11:16:33 | 600 2 | 1 | Withdrawal | 21/01/2017 11:16:33 | 20/01/2017 11:16:33 | 430 3 | 2 | Deposit | 23/02/2017 11:16:33 | 23/02/2017 11:16:33 | 500 6 | 2 | Withdrawal | 27/02/2017 11:16:33 | 23/02/2017 11:16:33 | 200
注意:如评论所述,我认为您显示的预期结果存在问题。具有 id
s 4
和 7
的行不应出现在输出中,因为具有 id 2
的行具有相同的 SessionId
和 TransactionType
并且较早的 DateTimeEnd
。
正如 GMB 指出的那样,这是一个间隙和孤岛问题。因为您想保留第一行,所以我建议使用 lag()
方法而不是行号的差异:
SELECT SessionId, TransactionType, DateTimeEnd,DateStart, sumRank
FROM (SELECT t.*,
SUM(Rank) OVER (PARTITION BY SessionId, TransactionType, grp) as sumRank
FROM (SELECT t.*,
SUM(CASE WHEN prev_st_id = prev_id THEN 0 ELSE 1 END) OVER (ORDER BY id) as grp
FROM (SELECT t.*,
LAG(id) OVER (PARTITION BY SessionId, TransactionType ORDER BY id) as prev_st_id,
LAG(id) OVER (PARTITION BY SessionId ORDER BY id) as prev_id
FROM Transactions t
) t
) t
) t
WHERE prev_st_id <> prev_id OR prev_st_id IS NULL;
这是做什么的?
- 最内层子查询计算 id 的整体滞后和 session/transaction 类型的滞后。这使用
id
因为它看起来比 date/times 更稳定(其中一列中有重复的 date/time 值)。 - 当id不同时,则识别出一个新的岛屿。累积总和标识组。
- 然后
grp
使用 window 函数计算整个组的值。 - 然后外部查询只过滤到每个组中的第一行。
Here 是一个 db<>fiddle.