在 Google bigquery 中对非分区数据进行 运行 计数
Doing a running count on non-partitioned data in Google bigquery
我有一些数据如下所示
Date | Priority
----------------
01/01 | Low
02/01 | Low
03/01 | Low
04/01 | Med
05/01 | Med
06/01 | Low
07/01 | High
08/01 | High
09/01 | Med
...
我想在其中添加一个列,显示它处于当前优先级的天数,所以它看起来像这样:
Date | Priority | Days in state
--------------------------------
01/01 | Low | 3
02/01 | Low | 2
03/01 | Low | 1
04/01 | Med | 2
05/01 | Med | 1
06/01 | Low | 1
07/01 | High | 2
08/01 | High | 1
09/01 | Med | 1
...
我很难做到这一点,因为我无法按原样对数据进行分区。按优先级划分会考虑历史上该优先级的每次出现,而不仅仅是当前的 "window".
我已经使用 IF(LAG(priority) OVER(ORDER BY date) = priority,1,0)
标记何时发生变化,但我不知道从那里去哪里。
这是一种缺口孤岛问题。出于您的目的,最简单的方法可能是减去一个序列并使用 window 函数:
select t.*,
row_number() over (partition by status, grp order by date date desc) as days_to_next_state
from (select t.*,
date_add(date, interval - seqnum day) as grp
from (select t.*,
row_number() over (partition by status, order by date) as seqnum
from t
) t
) t
以下适用于 BigQuery 标准 SQL
#standardSQL
SELECT date, Priority,
1 + DATE_DIFF(MAX(date) OVER(PARTITION BY grp), date, DAY) Days_in_state
FROM (
SELECT date, Priority,
COUNTIF(start_new_Priority) OVER(ORDER BY date)grp
FROM (
SELECT date, Priority,
IFNULL(Priority != LAG(Priority) OVER(ORDER BY date), TRUE) start_new_Priority
FROM `project.dataset.table`
)
)
如果应用到您的问题结果中的示例数据是
Row date Priority Days_in_state
1 2019-01-01 Low 3
2 2019-01-02 Low 2
3 2019-01-03 Low 1
4 2019-01-04 Med 2
5 2019-01-05 Med 1
6 2019-01-06 Low 1
7 2019-01-07 High 2
8 2019-01-08 High 1
9 2019-01-09 Med 1
我有一些数据如下所示
Date | Priority
----------------
01/01 | Low
02/01 | Low
03/01 | Low
04/01 | Med
05/01 | Med
06/01 | Low
07/01 | High
08/01 | High
09/01 | Med
...
我想在其中添加一个列,显示它处于当前优先级的天数,所以它看起来像这样:
Date | Priority | Days in state
--------------------------------
01/01 | Low | 3
02/01 | Low | 2
03/01 | Low | 1
04/01 | Med | 2
05/01 | Med | 1
06/01 | Low | 1
07/01 | High | 2
08/01 | High | 1
09/01 | Med | 1
...
我很难做到这一点,因为我无法按原样对数据进行分区。按优先级划分会考虑历史上该优先级的每次出现,而不仅仅是当前的 "window".
我已经使用 IF(LAG(priority) OVER(ORDER BY date) = priority,1,0)
标记何时发生变化,但我不知道从那里去哪里。
这是一种缺口孤岛问题。出于您的目的,最简单的方法可能是减去一个序列并使用 window 函数:
select t.*,
row_number() over (partition by status, grp order by date date desc) as days_to_next_state
from (select t.*,
date_add(date, interval - seqnum day) as grp
from (select t.*,
row_number() over (partition by status, order by date) as seqnum
from t
) t
) t
以下适用于 BigQuery 标准 SQL
#standardSQL
SELECT date, Priority,
1 + DATE_DIFF(MAX(date) OVER(PARTITION BY grp), date, DAY) Days_in_state
FROM (
SELECT date, Priority,
COUNTIF(start_new_Priority) OVER(ORDER BY date)grp
FROM (
SELECT date, Priority,
IFNULL(Priority != LAG(Priority) OVER(ORDER BY date), TRUE) start_new_Priority
FROM `project.dataset.table`
)
)
如果应用到您的问题结果中的示例数据是
Row date Priority Days_in_state
1 2019-01-01 Low 3
2 2019-01-02 Low 2
3 2019-01-03 Low 1
4 2019-01-04 Med 2
5 2019-01-05 Med 1
6 2019-01-06 Low 1
7 2019-01-07 High 2
8 2019-01-08 High 1
9 2019-01-09 Med 1