如何对 BigQuery 中的重复 window 进行聚合
How to take aggregations for a repeating window in BigQuery
我有一个数据集,其中有一个包含步骤的 id
。每个步骤都有一个时间戳和通道名称。对于给定的 id,频道可以在多个时间戳重复多次。
我正在尝试测量重复频道的每个块(按时间戳排序),发生了多少次?
这是我的示例数据 -
with temp as (
select 1 as id, '2019-08-02 13:13:27 UTC' as t_date, 'email' as channel union all
select 1 as id, '2019-08-02 13:14:27 UTC' as t_date, 'email' as channel union all
select 1 as id, '2019-08-02 13:15:27 UTC' as t_date, 'display' as channel union all
select 1 as id, '2019-08-02 13:16:27 UTC' as t_date, 'display' as channel union all
select 1 as id, '2019-08-02 13:17:27 UTC' as t_date, 'email' as channel union all
select 1 as id, '2019-08-02 13:18:27 UTC' as t_date, 'email' as channel union all
select 2 as id, '2019-08-02 13:11:27 UTC' as t_date, 'email' as channel union all
select 2 as id, '2019-08-02 13:12:27 UTC' as t_date, 'email' as channel union all
select 2 as id, '2019-08-02 13:13:27 UTC' as t_date, 'email' as channel union all
select 2 as id, '2019-08-02 13:14:27 UTC' as t_date, 'email' as channel
)
select id, channel , count(1) appearances
from temp
group by id , channel
order by id
这让我输出为
但是,我需要这样的东西 -
如输出所示,对于同时出现的每个通道序列,我需要计算 appearances
以及开始和结束时间。例如,输出中的第一条记录属于 email
通道,它从 id = 1 开始于 2019-08-02 13:13:27 UTC
并结束于 2019-08-02 13:14:27 UTC
- 按时间戳排序。最后一列显示多少次 email
频道在切换到下一个频道之前重复(在本例中显示)。
如何在 BigQuery 中实现这一点?
考虑以下方法
select id, channel,
min(t_date) as start_date,
max(t_date) as end_date,
count(1) as appearances
from (
select *, countif(new_group) over (partition by id order by t_date) group_id
from (
select *, ifnull(channel != lag(channel) over win, true) new_group
from temp
window win as (partition by id order by t_date)
)
)
group by id, channel, group_id
如果应用于您问题中的示例数据 - 输出为
我有一个数据集,其中有一个包含步骤的 id
。每个步骤都有一个时间戳和通道名称。对于给定的 id,频道可以在多个时间戳重复多次。
我正在尝试测量重复频道的每个块(按时间戳排序),发生了多少次?
这是我的示例数据 -
with temp as (
select 1 as id, '2019-08-02 13:13:27 UTC' as t_date, 'email' as channel union all
select 1 as id, '2019-08-02 13:14:27 UTC' as t_date, 'email' as channel union all
select 1 as id, '2019-08-02 13:15:27 UTC' as t_date, 'display' as channel union all
select 1 as id, '2019-08-02 13:16:27 UTC' as t_date, 'display' as channel union all
select 1 as id, '2019-08-02 13:17:27 UTC' as t_date, 'email' as channel union all
select 1 as id, '2019-08-02 13:18:27 UTC' as t_date, 'email' as channel union all
select 2 as id, '2019-08-02 13:11:27 UTC' as t_date, 'email' as channel union all
select 2 as id, '2019-08-02 13:12:27 UTC' as t_date, 'email' as channel union all
select 2 as id, '2019-08-02 13:13:27 UTC' as t_date, 'email' as channel union all
select 2 as id, '2019-08-02 13:14:27 UTC' as t_date, 'email' as channel
)
select id, channel , count(1) appearances
from temp
group by id , channel
order by id
这让我输出为
但是,我需要这样的东西 -
如输出所示,对于同时出现的每个通道序列,我需要计算 appearances
以及开始和结束时间。例如,输出中的第一条记录属于 email
通道,它从 id = 1 开始于 2019-08-02 13:13:27 UTC
并结束于 2019-08-02 13:14:27 UTC
- 按时间戳排序。最后一列显示多少次 email
频道在切换到下一个频道之前重复(在本例中显示)。
如何在 BigQuery 中实现这一点?
考虑以下方法
select id, channel,
min(t_date) as start_date,
max(t_date) as end_date,
count(1) as appearances
from (
select *, countif(new_group) over (partition by id order by t_date) group_id
from (
select *, ifnull(channel != lag(channel) over win, true) new_group
from temp
window win as (partition by id order by t_date)
)
)
group by id, channel, group_id
如果应用于您问题中的示例数据 - 输出为