Bigquery历史数据输出非连续失败次数

Output number of non-consecutive failures from historical data in Bigquery

这与我之前的场景有关。

我有这样的数据集:

WITH failure_table AS
  (SELECT 'Andrea' AS name, 'Failure' AS status, '2022-04-28 4:00:00' AS timestamp
   UNION ALL SELECT 'Karl', 'Failure', '2022-04-28 4:00:00'
   UNION ALL SELECT 'Andrea', 'Failure', '2022-04-27 4:00:00'
   UNION ALL SELECT 'Karl', 'Failure', '2022-04-27 4:00:00'
   UNION ALL SELECT 'Andrea', 'Failure', '2022-04-26 4:00:00'
   UNION ALL SELECT 'Andrea', 'Failure', '2022-04-25 4:00:00'
   UNION ALL SELECT 'Andrea', 'Failure', '2022-03-30 4:00:00'
   UNION ALL SELECT 'Andrea', 'Failure', '2022-03-29 4:00:00'
   UNION ALL SELECT 'Andrea', 'Failure', '2022-03-28 4:00:00'
   UNION ALL SELECT 'Karl', 'Failure', '2022-03-28 4:00:00')
   UNION ALL SELECT 'Andrea', 'Failure', '2022-03-15 4:00:00')

除了输出用户第一次提交失败的时间戳,每天连续提交失败状态,直到今天(2022-04-29),我还想输出非连续块Karl 或 Andrea 失败的天数。

在这种情况下,Andrea 最近在 2022-04-25 4:00:00 开始失败并提交了 3 个失败块(03-15、03-28 到 03-30、04-25 到 04-28)而 Karl 最近在 2022-04-27 4:00:00 开始失败并提交了 2 个失败块(03-28、04-27 到 04-28)。

最终输出应该是

name status started recently failing timestamp recent days failing total days failing total failure blocks
Andrea Failure 2022-04-25 4:00:00 4 8 3
Karl Failure 2022-04-27 4:00:00 2 3 2

感谢能提供帮助的人,不胜感激

看看下面的查询,虽然它还没有完善。希望能帮助您找到解决问题的线索。

  1. failure_blocks 用于计算每个连续失败的天数。
  2. last_blocks 用于查找最后一个失败块以识别 started_recently_failing_timestamp
  3. 主查询从以前的 CTE 生成预期的输出。
WITH failure_table AS (
  SELECT 'Andrea' AS name, 'Failure' AS status, TIMESTAMP '2022-04-28 4:00:00' AS dt
   UNION ALL SELECT 'Karl', 'Failure', '2022-04-28 4:00:00'
   UNION ALL SELECT 'Andrea', 'Failure', '2022-04-27 4:00:00'
   UNION ALL SELECT 'Karl', 'Failure', '2022-04-27 4:00:00'
   UNION ALL SELECT 'Andrea', 'Failure', '2022-04-26 4:00:00'
   UNION ALL SELECT 'Andrea', 'Failure', '2022-04-25 4:00:00'
   UNION ALL SELECT 'Andrea', 'Failure', '2022-03-30 4:00:00'
   UNION ALL SELECT 'Andrea', 'Failure', '2022-03-29 4:00:00'
   UNION ALL SELECT 'Andrea', 'Failure', '2022-03-28 4:00:00'
   UNION ALL SELECT 'Karl', 'Failure', '2022-03-28 4:00:00'
   UNION ALL SELECT 'Andrea', 'Failure', '2022-03-15 4:00:00'
),
failure_blocks AS (
  SELECT *,
         COUNTIF(diff <> 1) OVER (PARTITION BY name) AS total_failure_blocks,
         COUNT(*) OVER (PARTITION BY name) AS total_days_failing,
         SUM(diff - 1) OVER (PARTITION BY name ORDER BY dt) AS block,
    FROM (
      SELECT name, status, dt, IFNULL(DATE_DIFF(dt, LAG(dt) OVER (PARTITION BY name ORDER BY dt), DAY), 0) AS diff
        FROM failure_table
    )
),
last_blocks AS (
SELECT * EXCEPT(diff, block), 
       COUNT(*) OVER (PARTITION BY name, block) AS recent_days_failing,
       FIRST_VALUE(dt) OVER (PARTITION BY name, block ORDER BY dt) AS block_start_dt
  FROM failure_blocks
)
SELECT name, status, 
       MAX(block_start_dt) OVER (PARTITION BY name) AS started_recently_failing_timestamp,
       recent_days_failing,
       total_days_failing,
       total_failure_blocks,
  FROM last_blocks 
 WHERE TRUE QUALIFY dt = started_recently_failing_timestamp
;

同时考虑以下方法

select name, status,
  sum(if(rank = 1, consecutive_days, 0)) as recent_days_failing,
  sum(consecutive_days) as total_days_failing,
  count(block_id) as total_failure_block_ids
from (
  select name, status, block_id, 
    date_diff(max(dt), min(dt), day) + 1 as consecutive_days,
    rank() over(partition by name, status order by block_id) rank
  from (
    select name, status, date(timestamp) dt,
      row_number() over(partition by name, status order by timestamp) + 
      date_diff(current_date, date(timestamp), day) as block_id
    from failure_table
  )
  group by name, status, block_id
)
group by name, status   

如果应用于您问题中的示例数据 - 输出为