T-SQL:统计第一次成功前的失败次数(2)

T-SQL: Count number of failures before first success (2)

我有一个数据库,其中包含任务尝试及其结果(失败或成功)的事件。对于每个用户,我想计算第一次成功之前的失败次数。随后的失败和成功不应影响输出——我只对给定任务的第一次成功感兴趣。此外,数据库中包含其他应忽略的事件的行。

我如何在 Vertica 数据库的 T-SQL 中表达这个?

(我最终想计算每个任务的平均尝试次数,但让我们将其排除在这个问题的范围之外,以使事情易于管理。)

这是对此处问题的更新:

在最初的问题中,我给出的样本数据结构不佳,无法完全反映我的使用场景,导致答案不适用于我的实际数据,我无法验证。

解决方案不应依赖于行顺序 - 行可能未按时间戳顺序填充。

这是数据库设置:

CREATE TABLE events {
      eventID int -- unused in this example, should be excluded from output
    , eventName varchar(256)
    , userName varchar(256)
    , timestamp timestamp
    , taskName varchar(256)
    , sessionID int -- unused in this example, should be excluded from output
};

INSERT INTO events
    VALUES
        (2363460186192576512, 'beginSession', 'John', '2017-08-14 09:46:46.712', NULL, 145031357)
      , (2363460852537008128, 'success', 'John', '2017-08-14 09:49:32.471', 'TaskOne', 145031357)
      , (2363461162974437376, 'success', 'John', '2017-08-14 09:50:48.781', 'TaskOne', 145031357)
      , (2363460390131740672, 'fail', 'John', '2017-08-14 09:47:37.349', 'TaskOne', 145031357)
      , (2363460556662710272, 'fail', 'John', '2017-08-14 09:48:23.024', 'TaskOne', 145031357)
      , (2363460730671505408, 'fail', 'John', '2017-08-14 09:48:58.646', 'TaskOne', 145031357)
      , (2363461032111800320, 'fail', 'John', '2017-08-14 09:50:10.726', 'TaskOne', 145031357)
      , (2363460389896859648, 'beginTask', 'John', '2017-08-14 09:47:05.32', 'TaskOne', 145031357)
      , (2363460463137751040, 'beginTask', 'John', '2017-08-14 09:47:52.166', 'TaskOne', 145031357)
      , (2363460556205531136, 'beginTask', 'John', '2017-08-14 09:48:12.615', 'TaskOne', 145031357)
      , (2363460692671205376, 'beginTask', 'John', '2017-08-14 09:48:36.155', 'TaskOne', 145031357)
      , (2363460852268572672, 'beginTask', 'John', '2017-08-14 09:49:12.047', 'TaskOne', 145031357)
      , (2363460962524327936, 'beginTask', 'John', '2017-08-14 09:49:47.951', 'TaskOne', 145031357)
      , (2363461162714390528, 'beginTask', 'John', '2017-08-14 09:50:23.645', 'TaskOne', 145031357)
      , (2363474741421064192, 'beginSession', 'John', '2017-08-14 10:44:36.042', NULL, 145031392)
      , (2363474885491200000, 'success', 'John', '2017-08-14 10:45:14.577', 'TaskTwo', 145031392)
      , (2363475342389641216, 'success', 'John', '2017-08-14 10:47:04.098', 'TaskTwo', 145031392)
      , (2363475473998635008, 'success', 'John', '2017-08-14 10:47:34.135', 'TaskOne', 145031392)
      , (2363475822079254528, 'success', 'John', '2017-08-14 10:48:53.381', 'TaskTwo', 145031392)
      , (2363476096949104640, 'success', 'John', '2017-08-14 10:50:07.441', 'TaskThree', 145031392)
      , (2363475066098266112, 'fail', 'John', '2017-08-14 10:45:53.526', 'TaskTwo', 145031392)
      , (2363475195152531456, 'fail', 'John', '2017-08-14 10:46:32.81', 'TaskTwo', 145031392)
      , (2363475654638821376, 'fail', 'John', '2017-08-14 10:48:13.71', 'TaskThree', 145031392)
      , (2363476247751114752, 'beginSession', 'Mike', '2017-08-14 10:50:37.67', NULL, 145030476)
      , (2363476335819063296, 'success', 'Mike', '2017-08-14 10:51:06.841', 'TaskOne', 145030476)
      , (2363476485643796480, 'success', 'Mike', '2017-08-14 10:51:41.086', 'TaskTwo', 145030476)
      , (2363476806063038464, 'success', 'Mike', '2017-08-14 10:52:53.174', 'TaskTwo', 145030476)
      , (2363477266119335936, 'success', 'Mike', '2017-08-14 10:54:32.053', 'TaskThree', 145030476)
      , (2363477619191631872, 'success', 'Mike', '2017-08-14 10:56:01.783', 'TaskThree', 145030476)
      , (2363476705131655168, 'fail', 'Mike', '2017-08-14 10:52:21.312', 'TaskThree', 145030476)
      , (2363476939634896896, 'fail', 'Mike', '2017-08-14 10:53:28.906', 'TaskThree', 145030476)
      , (2363477390937976832, 'fail', 'Mike', '2017-08-14 10:55:05.499', 'TaskThree', 145030476)
      , (2363476335592570880, 'beginTask', 'Mike', '2017-08-14 10:50:50.074', 'TaskOne', 145030476)
      , (2363476485501190144, 'beginTask', 'Mike', '2017-08-14 10:51:20.784', 'TaskTwo', 145030476)
      , (2363476704779333632, 'beginTask', 'Mike', '2017-08-14 10:51:54.829', 'TaskThree', 145030476)
      , (2363476805752659968, 'beginTask', 'Mike', '2017-08-14 10:52:34.001', 'TaskTwo', 145030476)
      , (2363476939496484864, 'beginTask', 'Mike', '2017-08-14 10:53:06.468', 'TaskThree', 145030476)
      , (2363477265938980864, 'beginTask', 'Mike', '2017-08-14 10:53:45.631', 'TaskThree', 145030476)
      , (2363477390635986944, 'beginTask', 'Mike', '2017-08-14 10:54:44.706', 'TaskThree', 145030476)
      , (2363477573427560448, 'beginTask', 'Mike', '2017-08-14 10:55:17.231', 'TaskThree', 145030476)
      , (2363474885214375936, 'beginTask', 'John', '2017-08-14 10:44:44.702', 'TaskTwo', 145031392)
      , (2363474985177161728, 'beginTask', 'John', '2017-08-14 10:45:31.133', 'TaskTwo', 145031392)
      , (2363475195014119424, 'beginTask', 'John', '2017-08-14 10:46:10.098', 'TaskTwo', 145031392)
      , (2363475342184120320, 'beginTask', 'John', '2017-08-14 10:46:45.357', 'TaskTwo', 145031392)
      , (2363475473616953344, 'beginTask', 'John', '2017-08-14 10:47:17.911', 'TaskOne', 145031392)
      , (2363475654437494784, 'beginTask', 'John', '2017-08-14 10:47:47.681', 'TaskThree', 145031392)
      , (2363475771776864256, 'beginTask', 'John', '2017-08-14 10:48:27.1', 'TaskTwo', 145031392)
      , (2363476006456762368, 'beginTask', 'John', '2017-08-14 10:49:06.151', 'TaskThree', 145031392)
    ;

根据这些数据,这是我要实现的结果:

userName  taskName   numFailuresBeforeFirstSuccess
John      TaskOne    3
John      TaskTwo    0
John      TaskThree  1
Mike      TaskOne    0
Mike      TaskTwo    0
Mike      TaskThree  3

这是一种方法:

select e.username, e.taskname,
       sum(case when timestamp < first_success_ts and e.eventname = 'fail' then 1 else 0 end) as numFailuresBeforeSuccess
from (select e.*,
             min(case when e.eventname = 'success' then e.timestamp end) over
                (partition by e.username, e.taskname) as first_success_ts
      from events e
     ) e
group by e.username, e.taskname
order by e.username, e.taskname;

这使用 window 函数计算首次成功时间。这应该适用于两个数据库(至少在 SQL Server 2012+)

这个查询:

 with F as
 (
    select *  from Evts where eventName = 'fail'
 ),

 S as
 (
    select * from Evts E 
        cross apply
        (
            select count(F.eventID) numFailuresBeforeFirstSuccess from F 
                where F.userName = E.userName and 
                      E.taskName = F.taskName and 
                      F.timestamp < E.timestamp
        ) K

    where eventName = 'success'
 )

 select userName, taskName, numFailuresBeforeFirstSuccess from      
    (select *, row_number() over (partition by userName, taskName order by [timestamp] desc) o from S ) S 
       where o = 1

产生这个结果:

userName    taskName    numFailuresBeforeFirstSuccess
----------- ----------- -----------------------------
John        TaskOne     4
John        TaskThree   1
John        TaskTwo     2
Mike        TaskOne     0
Mike        TaskThree   3
Mike        TaskTwo     0

适用于此。

Rextester Demo

再次声明,这是 TSQL 而不是 Vertica,但它是相当标准的 SQL,只要 Vertica 支持 CTE。

; WITH cte1 AS (
    SELECT t1.userName, t1.taskName, t1.ts
        , LAG(t1.ts) OVER (PARTITION BY t1.userName, t1.taskName ORDER BY t1.ts) AS PreviousTS 
        , ROW_NUMBER() OVER (PARTITION BY t1.userName ORDER BY t1.ts) AS rn
    FROM #taskevents t1
    WHERE t1.eventName = 'Success'
)
SELECT s1.userName, s1.taskName, AVG(s1.failCount) AS avgFailCount
FROM (
    SELECT cte1.userName, cte1.taskName , cte1.rn,  COALESCE(COUNT(t2.ts),0) AS failCount 
    FROM cte1
    LEFT OUTER JOIN #taskevents t2 ON t2.userName = cte1.userName
        AND t2.taskName = cte1.taskName
        AND t2.ts < cte1.ts  
        AND ( t2.ts >= cte1.PreviousTS OR cte1.PreviousTS IS NULL )
        AND  t2.eventName = 'fail'
    GROUP BY cte1.userName, cte1.taskName, cte1.rn
) s1
GROUP BY s1.userName, s1.taskName
ORDER BY s1.userName, s1.taskName

这给出了您的平均值。删除外部查询以获取我正在使用的数据。它产生的结果与您列出的结果略有不同,但应该给出您所说的您正在寻找的适当平均值。如果我误解了要求,请告诉我。

注意:在我的测试数据中,我还添加了两个失败但没有成功的人,只是为了验证他们是否被排除在结果之外。

, (2363476006456762398, 'fail', 'Steve', '2017-08-14 11:29:06.151', 'Task42', 145031342)
, (2363476046456762368, 'fail', 'Joe', '2017-08-14 11:49:06.151', 'Task42', 145031399)

=====================================

结果

-----------------------------------
|userName|  taskName |avgFailCount|
-----------------------------------
|  John  | TaskOne   |     1      |
|  John  | TaskThree |     1      |
|  John  | TaskTwo   |     0      |
|  Mike  | TaskOne   |     0      |
|  Mike  | TaskThree |     1      |
|  Mike  | TaskTwo   |     0      |
-----------------------------------

============================================= ===========================

编辑:对于仅任务的平均值:

; WITH cte1 AS (
    SELECT t1.userName, t1.taskName, t1.ts
        , LAG(t1.ts) OVER (PARTITION BY t1.userName, t1.taskName ORDER BY t1.ts) AS PreviousTS 
        , ROW_NUMBER() OVER (PARTITION BY t1.userName ORDER BY t1.ts) AS rn
    FROM #taskevents t1
    WHERE t1.eventName = 'Success'
)
SELECT s1.taskName
    , AVG(CAST(s1.failCount AS decimal(5,2))) AS avgFailCount
FROM (
    SELECT cte1.userName, cte1.taskName , cte1.rn,  COALESCE(COUNT(t2.ts),0) AS failCount 
    FROM cte1
    LEFT OUTER JOIN #taskevents t2 ON t2.userName = cte1.userName
        AND t2.taskName = cte1.taskName
        AND t2.ts < cte1.ts  
        AND ( t2.ts >= cte1.PreviousTS OR cte1.PreviousTS IS NULL )
        AND  t2.eventName = 'fail'
    GROUP BY cte1.userName, cte1.taskName, cte1.rn
) s1
GROUP BY s1.taskName
ORDER BY s1.taskName

这给了你

--------------------------
| taskName  |avgFailCount|
--------------------------
| TaskOne   | 1.000000   |
| TaskThree | 1.333333   |
| TaskTwo   | 0.400000   |
--------------------------

本质上是

SELECT (3+1+0+0)/4.0  AS TaskOne
SELECT (0+2+0+0+0)/5.0 AS TaskTwo
SELECT (1+2+1)/3.0 AS TaskThree

来自以下数据点。

--------------------------------
|userName|  taskName |FailCount|
--------------------------------
|  John  | TaskOne   |    3    |
|  John  | TaskOne   |    1    |
|  John  | TaskOne   |    0    |
|  Mike  | TaskOne   |    0    |
|  John  | TaskTwo   |    0    |
|  John  | TaskTwo   |    2    |
|  John  | TaskTwo   |    0    |
|  Mike  | TaskTwo   |    0    |
|  Mike  | TaskTwo   |    0    |
|  John  | TaskThree |    1    |
|  Mike  | TaskThree |    2    |
|  Mike  | TaskThree |    1    |
--------------------------------

这是成功前的平均失败次数,而不是每次尝试失败的平均次数。那会略有不同。

---------------------------------------------------
| task | fails | attempts | avg fails per attempt |
---------------------------------------------------
| Task1|   4   |    8     | 4/8 = 0.500000        |
| Task2|   2   |    7     | 2/7 = 0.285714        |
| Task3|   3   |    7     | 3/7 = 0.428571        |
---------------------------------------------------