在 Google bigquery 中对非分区数据进行 运行 计数

Doing a running count on non-partitioned data in Google bigquery

我有一些数据如下所示

Date  | Priority
----------------
01/01 | Low
02/01 | Low
03/01 | Low
04/01 | Med
05/01 | Med
06/01 | Low
07/01 | High
08/01 | High
09/01 | Med
...

我想在其中添加一个列,显示它处于当前优先级的天数,所以它看起来像这样:

Date  | Priority | Days in state
--------------------------------
01/01 | Low      | 3
02/01 | Low      | 2
03/01 | Low      | 1
04/01 | Med      | 2
05/01 | Med      | 1
06/01 | Low      | 1
07/01 | High     | 2
08/01 | High     | 1
09/01 | Med      | 1
...

我很难做到这一点,因为我无法按原样对数据进行分区。按优先级划分会考虑历史上该优先级的每次出现,而不仅仅是当前的 "window".

我已经使用 IF(LAG(priority) OVER(ORDER BY date) = priority,1,0) 标记何时发生变化,但我不知道从那里去哪里。

这是一种缺口孤岛问题。出于您的目的,最简单的方法可能是减去一个序列并使用 window 函数:

select t.*,
       row_number() over (partition by status, grp order by date date desc) as days_to_next_state
from (select t.*,
             date_add(date, interval - seqnum day) as grp
      from (select t.*,
                   row_number() over (partition by status, order by date) as seqnum
            from t
           ) t
     ) t

以下适用于 BigQuery 标准 SQL

#standardSQL
SELECT date, Priority, 
  1 + DATE_DIFF(MAX(date) OVER(PARTITION BY grp), date, DAY) Days_in_state
FROM (
  SELECT date, Priority, 
    COUNTIF(start_new_Priority) OVER(ORDER BY date)grp
  FROM (
    SELECT date, Priority, 
      IFNULL(Priority != LAG(Priority) OVER(ORDER BY date), TRUE) start_new_Priority
    FROM `project.dataset.table`
  )
)

如果应用到您的问题结果中的示例数据是

Row date        Priority    Days_in_state    
1   2019-01-01  Low         3    
2   2019-01-02  Low         2    
3   2019-01-03  Low         1    
4   2019-01-04  Med         2    
5   2019-01-05  Med         1    
6   2019-01-06  Low         1    
7   2019-01-07  High        2    
8   2019-01-08  High        1    
9   2019-01-09  Med         1