合并 T-SQL 中连续行的组,并对每组的值求和

Merge groups of consecutive rows in T-SQL and sum values from each group

2019 年 10 月 8 日更新:

@Gordon Linoff:我尝试应用您的解决方案,但我意识到它没有按预期工作。我在此处添加了一个带有注释的预期结果示例 (https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=1b486476d6aeab25997f25e66ee455e9),如果您能帮助我,我将不胜感激。

--

我有一个 table 的交易模式:

CREATE TABLE Transactions (Id int IDENTITY, SessionId int, TransactionType varchar(50), DateTimeEnd datetime, DateStart datetime, Rank int);

以下是一些行示例:

INSERT INTO Transactions (Id, SessionId, TransactionType, DateTimeEnd, DateStart, Rank)
VALUES
 (1, 1, 'Deposit',    '2017-01-20T11:16:33Z', '2017-01-20T11:16:33Z', 600),
 (2, 1, 'Withdrawal', '2017-01-21T11:16:33Z', '2017-01-20T11:16:33Z', 100),
 (3, 2, 'Deposit',    '2017-02-23T11:16:33Z', '2017-02-23T11:16:33Z', 500),
 (4, 1, 'Withdrawal', '2017-01-24T11:16:33Z', '2017-01-21T11:16:33Z', 150),
 (5, 1, 'Withdrawal', '2017-01-26T11:16:33Z', '2017-01-24T11:16:33Z', 150),
 (6, 2, 'Withdrawal', '2017-02-27T11:16:33Z', '2017-02-23T11:16:33Z', 200),
 (7, 1, 'Withdrawal', '2017-01-28T11:16:33Z', '2017-01-26T11:16:33Z', 10),
 (8, 1, 'Withdrawal', '2017-01-30T11:16:33Z', '2017-01-28T11:16:33Z', 10),
 (9, 1, 'Withdrawal', '2017-01-31T11:16:33Z', '2017-01-30T11:16:33Z', 10);

我想要的是一个 T-SQL 查询,用于按 SessionId、TransactionType 和每个组合并连续行的组,以仅保留具有最小 DateTimeEnd 的行。此外,保留的行的等级值必须是组中各行等级值的总和。 T-SQL 查询需要 运行 in MS SQL Server in Microsoft Azure SQL Data Warehouse.

期望的结果:

|    Id    |     SessionId    | Transaction |       DateTimeEnd  |      DateStart     |   Rank  |
|----------|------------------|-------------|--------------------|--------------------|---------|
|    1     |         1        |      Deposit|2017-01-20T11:16:33Z|2017-01-20T11:16:33Z|   600   |
|    2     |         1        |   Withdrawal|2017-01-21T11:16:33Z|2017-01-20T11:16:33Z|   100   |
|  4       |         1        |   Withdrawal|2017-01-24T11:16:33Z|2017-01-21T11:16:33Z|   300   |
|  7       |         1        |   Withdrawal|2017-01-28T11:16:33Z|2017-01-26T11:16:33Z|    30   |
|    3     |         2        |      Deposit|2017-02-23T11:16:33Z|2017-02-23T11:16:33Z|   500   |
|    6     |         2        |   Withdrawal|2017-02-27T11:16:33Z|2017-02-23T11:16:33Z|   200   |

我试了很多方法都无法实现。

这是一个间隙和岛屿变体。

我会按如下方式处理:

1) 首先,识别并合并记录组。以下查询为您提供每个组的组最小值 DateTimeEnd,以及排名总和

SELECT 
    SessionId, 
    TransactionType, 
    SUM(Rank) SumRank, 
    MIN(DateTimeEnd) MinDateTimeEnd
FROM (
    SELECT 
        t.*,
        ROW_NUMBER() OVER(ORDER BY DateTimeEnd) rn1,
        ROW_NUMBER() OVER(PARTITION BY SessionId, TransactionType ORDER BY DateTimeEnd) rn2
    FROM Transactions t
 ) x
GROUP BY SessionId, TransactionType, rn1 - rn2

Returns:

SessionId | TransactionType | SumRank | MinDateTimeEnd     
--------: | :-------------- | ------: | :------------------
        1 | Deposit         |     600 | 20/01/2017 11:16:33
        1 | Withdrawal      |     430 | 21/01/2017 11:16:33
        2 | Deposit         |     500 | 23/02/2017 11:16:33
        2 | Withdrawal      |     200 | 27/02/2017 11:16:33

2) 然后,将上面查询的结果与原来的table连接起来,取出剩下的列:

SELECT 
    t.id,
    t.SessionId,
    t.TransactionType,
    t.DateTimeEnd,
    t.DateStart,
    x.SumRank
FROM Transactions t
INNER JOIN (
    SELECT 
        SessionId, 
        TransactionType, 
        SUM(Rank) SumRank, 
        MIN(DateTimeEnd) MinDateTimeEnd
    FROM (
        SELECT 
            t.*,
            ROW_NUMBER() OVER(ORDER BY DateTimeEnd) rn1,
            ROW_NUMBER() OVER(PARTITION BY SessionId, TransactionType ORDER BY DateTimeEnd) rn2
        FROM Transactions t
    ) x
    GROUP BY SessionId, TransactionType, rn1 - rn2
) x 
    ON  x.SessionId = t.SessionId
    AND x.TransactionType = t.TransactionType
    AND x.MinDateTimeEnd = t.DateTimeEnd

产量:

id | SessionId | TransactionType | DateTimeEnd         | DateStart           | SumRank
-: | --------: | :-------------- | :------------------ | :------------------ | ------:
 1 |         1 | Deposit         | 20/01/2017 11:16:33 | 20/01/2017 11:16:33 |     600
 2 |         1 | Withdrawal      | 21/01/2017 11:16:33 | 20/01/2017 11:16:33 |     430
 3 |         2 | Deposit         | 23/02/2017 11:16:33 | 23/02/2017 11:16:33 |     500
 6 |         2 | Withdrawal      | 27/02/2017 11:16:33 | 23/02/2017 11:16:33 |     200

Demo on DB Fiddle

注意:如评论所述,我认为您显示的预期结果存在问题。具有 ids 47 的行不应出现在输出中,因为具有 id 2 的行具有相同的 SessionIdTransactionType 并且较早的 DateTimeEnd

正如 GMB 指出的那样,这是一个间隙和孤岛问题。因为您想保留第一行,所以我建议使用 lag() 方法而不是行号的差异:

SELECT SessionId, TransactionType, DateTimeEnd,DateStart, sumRank
FROM (SELECT t.*,
             SUM(Rank) OVER (PARTITION BY SessionId, TransactionType, grp) as sumRank
      FROM (SELECT t.*,
                   SUM(CASE WHEN prev_st_id = prev_id THEN 0 ELSE 1 END) OVER (ORDER BY id) as grp
            FROM (SELECT t.*,
                         LAG(id) OVER (PARTITION BY SessionId, TransactionType ORDER BY id) as prev_st_id,
                         LAG(id) OVER (PARTITION BY SessionId ORDER BY id) as prev_id
                  FROM Transactions t
                 ) t
           ) t
     ) t
WHERE prev_st_id <> prev_id OR prev_st_id IS NULL;

这是做什么的?

  • 最内层子查询计算 id 的整体滞后和 session/transaction 类型的滞后。这使用 id 因为它看起来比 date/times 更稳定(其中一列中有重复的 date/time 值)。
  • 当id不同时,则识别出一个新的岛屿。累积总和标识组。
  • 然后 grp 使用 window 函数计算整个组的值。
  • 然后外部查询只过滤到每个组中的第一行。

Here 是一个 db<>fiddle.