如何在 BigQuery 中进行数据分组?
How to Do Data-Grouping in BigQuery?
我有需要分组的数据库列表。我已经使用 R 成功地完成了这项工作,但现在我必须使用 BigQuery 来完成这项工作。数据显示如下table
| category | sub_category | date | day | timestamp | type | cpc | gmv |
|---------- |-------------- |----------- |----- |------------- |------ |------ |--------- |
| ABC | ABC-1 | 2/17/2020 | Mon | 11:37:36 PM | BI | 1.94 | 252,293 |
| ABC | ABC-1 | 2/17/2020 | Mon | 11:37:39 PM | RT | 1.94 | 252,293 |
| ABC | ABC-1 | 2/17/2020 | Mon | 11:38:29 PM | RT | 1.58 | 205,041 |
| ABC | ABC-1 | 2/18/2020 | Tue | 12:05:14 AM | BI | 1.6 | 208,397 |
| ABC | ABC-1 | 2/18/2020 | Tue | 12:05:18 AM | RT | 1.6 | 208,397 |
| ABC | ABC-1 | 2/18/2020 | Tue | 12:05:52 AM | RT | 1.6 | 208,397 |
| ABC | ABC-1 | 2/18/2020 | Tue | 12:06:33 AM | BI | 1.55 | 201,354 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:55:47 PM | PP | 1 | 129,282 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:56:23 PM | PP | 0.98 | 126,928 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:57:19 PM | PP | 0.98 | 126,928 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:57:34 PM | PP | 0.98 | 126,928 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:58:46 PM | PP | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:59:27 PM | PP | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:59:51 PM | RT | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 12:00:57 AM | BI | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 12:01:11 AM | PP | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 12:03:01 AM | PP | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 12:12:42 AM | RT | 1.19 | 154,886 |
我想对行进行分组。与下一行具有 <= 8 分钟时间戳差异 的行将被分组为一行,输出示例如下:
| category | sub_category | date | day | time | start_timestamp | end_timestamp | type | cpc | gmv |
|---------- |-------------- |----------------------- |--------- |---------- |--------------------- |--------------------- |---------- |------ |--------- |
| ABC | ABC-1 | 2/17/2020 | Mon | 23:37:36 | (02/17/20 23:37:36) | (02/17/20 23:38:29) | BI|RT | 1.82 | 236,542 |
| ABC | ABC-1 | 2/18/2020 | Tue | 0:05:14 | (02/18/20 00:05:14) | (02/18/20 00:06:33) | BI|RT | 1.59 | 206,636 |
| XYZ | XYZ-1 | 02/17/2020|02/18/2020 | Mon|Tue | 0:06:21 | (02/17/20 23:55:47) | (02/18/20 00:12:42) | PP|RT|BI | 0.95 | 123,815 |
有一些新生成的字段如下:
| fields | definition |
|----------------- |-------------------------------------------------------- |
| day | Day of the row (combination if there's different days) |
| time | Start of timestamp |
| start_timestamp | Start timestamp of the first row in group |
| end_timestamp | Start timestamp of the last row in group |
| type | Type of Row (combination if there's different types) |
| cpc | Average CPC of the Group |
| gwm | Average GMV of the Group |
谁能帮我按照上面的要求查询一下?
谢谢
这是一个缺口和孤岛问题。这是一个使用 lag()
和累积 sum()
来定义间隔小于 8 分钟的相邻记录组的解决方案;剩下的就是聚合了。
select
category,
sub_category,
string_agg(distinct day, '|' order by dt) day,
min(dt) start_dt,
max(dt) end_dt,
string_agg(distinct type, '|' order by dt) type,
avg(cpc) cpc,
avg(gwm) gwm
from (
select
t.*,
sum(case when dt <= datetime_add(lag_dt, interval 8 minute) then 0 else 1 end)
over(partition by category, sub_category order by dt) grp
from (
select
t.*,
lag(dt) over(partition by category, sub_category order by dt) lag_dt
from (
select t.*, datetime(date, timestamp) dt
from mytable t
) t
) t
) t
) t
group by category, sub_category, grp
请注意,您不应将时间戳的日期和时间部分存储在单独的列中:当您需要组合它们时,这会使逻辑更加复杂(我添加了另一层嵌套以避免重复转换,这会混淆了代码)。
我有需要分组的数据库列表。我已经使用 R 成功地完成了这项工作,但现在我必须使用 BigQuery 来完成这项工作。数据显示如下table
| category | sub_category | date | day | timestamp | type | cpc | gmv | |---------- |-------------- |----------- |----- |------------- |------ |------ |--------- | | ABC | ABC-1 | 2/17/2020 | Mon | 11:37:36 PM | BI | 1.94 | 252,293 | | ABC | ABC-1 | 2/17/2020 | Mon | 11:37:39 PM | RT | 1.94 | 252,293 | | ABC | ABC-1 | 2/17/2020 | Mon | 11:38:29 PM | RT | 1.58 | 205,041 | | ABC | ABC-1 | 2/18/2020 | Tue | 12:05:14 AM | BI | 1.6 | 208,397 | | ABC | ABC-1 | 2/18/2020 | Tue | 12:05:18 AM | RT | 1.6 | 208,397 | | ABC | ABC-1 | 2/18/2020 | Tue | 12:05:52 AM | RT | 1.6 | 208,397 | | ABC | ABC-1 | 2/18/2020 | Tue | 12:06:33 AM | BI | 1.55 | 201,354 | | XYZ | XYZ-1 | 2/17/2020 | Mon | 11:55:47 PM | PP | 1 | 129,282 | | XYZ | XYZ-1 | 2/17/2020 | Mon | 11:56:23 PM | PP | 0.98 | 126,928 | | XYZ | XYZ-1 | 2/17/2020 | Mon | 11:57:19 PM | PP | 0.98 | 126,928 | | XYZ | XYZ-1 | 2/17/2020 | Mon | 11:57:34 PM | PP | 0.98 | 126,928 | | XYZ | XYZ-1 | 2/17/2020 | Mon | 11:58:46 PM | PP | 0.89 | 116,168 | | XYZ | XYZ-1 | 2/17/2020 | Mon | 11:59:27 PM | PP | 0.89 | 116,168 | | XYZ | XYZ-1 | 2/17/2020 | Mon | 11:59:51 PM | RT | 0.89 | 116,168 | | XYZ | XYZ-1 | 2/17/2020 | Mon | 12:00:57 AM | BI | 0.89 | 116,168 | | XYZ | XYZ-1 | 2/17/2020 | Mon | 12:01:11 AM | PP | 0.89 | 116,168 | | XYZ | XYZ-1 | 2/17/2020 | Mon | 12:03:01 AM | PP | 0.89 | 116,168 | | XYZ | XYZ-1 | 2/17/2020 | Mon | 12:12:42 AM | RT | 1.19 | 154,886 |
我想对行进行分组。与下一行具有 <= 8 分钟时间戳差异 的行将被分组为一行,输出示例如下:
| category | sub_category | date | day | time | start_timestamp | end_timestamp | type | cpc | gmv | |---------- |-------------- |----------------------- |--------- |---------- |--------------------- |--------------------- |---------- |------ |--------- | | ABC | ABC-1 | 2/17/2020 | Mon | 23:37:36 | (02/17/20 23:37:36) | (02/17/20 23:38:29) | BI|RT | 1.82 | 236,542 | | ABC | ABC-1 | 2/18/2020 | Tue | 0:05:14 | (02/18/20 00:05:14) | (02/18/20 00:06:33) | BI|RT | 1.59 | 206,636 | | XYZ | XYZ-1 | 02/17/2020|02/18/2020 | Mon|Tue | 0:06:21 | (02/17/20 23:55:47) | (02/18/20 00:12:42) | PP|RT|BI | 0.95 | 123,815 |
有一些新生成的字段如下:
| fields | definition | |----------------- |-------------------------------------------------------- | | day | Day of the row (combination if there's different days) | | time | Start of timestamp | | start_timestamp | Start timestamp of the first row in group | | end_timestamp | Start timestamp of the last row in group | | type | Type of Row (combination if there's different types) | | cpc | Average CPC of the Group | | gwm | Average GMV of the Group |
谁能帮我按照上面的要求查询一下?
谢谢
这是一个缺口和孤岛问题。这是一个使用 lag()
和累积 sum()
来定义间隔小于 8 分钟的相邻记录组的解决方案;剩下的就是聚合了。
select
category,
sub_category,
string_agg(distinct day, '|' order by dt) day,
min(dt) start_dt,
max(dt) end_dt,
string_agg(distinct type, '|' order by dt) type,
avg(cpc) cpc,
avg(gwm) gwm
from (
select
t.*,
sum(case when dt <= datetime_add(lag_dt, interval 8 minute) then 0 else 1 end)
over(partition by category, sub_category order by dt) grp
from (
select
t.*,
lag(dt) over(partition by category, sub_category order by dt) lag_dt
from (
select t.*, datetime(date, timestamp) dt
from mytable t
) t
) t
) t
) t
group by category, sub_category, grp
请注意,您不应将时间戳的日期和时间部分存储在单独的列中:当您需要组合它们时,这会使逻辑更加复杂(我添加了另一层嵌套以避免重复转换,这会混淆了代码)。