将一个大查询 table 聚合到另一个大查询 table

Question

我正在尝试将价值多 PB（约 7PB）的 BigQuery Table 聚合到另一个 BigQuery Table

我有(partition_key, clusterkey1, clusterkey2, col1, col2, val)

其中 partition_key 用于 bigquery 分区，clusterkey 用于集群。

例如

(timestamp1, timestamp2, 0, 1, 2, 1)

(timestamp3, timestamp4, 0, 1, 2, 7)

(timestamp31, timestamp22, 2, 1, 2, 2)

(timestamp11, timestamp12, 2, 1, 2, 3)

结果应该是

(0, 1, 2, 8)

(2, 1, 2, 5)

我想根据 (clusterkey2, col1, col2) 对所有 partition_key 和所有 clusterkey1 进行聚合，得到 val

这样做的可行方法是什么？

我应该编写一个自定义加载程序并逐行读取所有数据，还是有本地方法可以做到这一点？

Answer 1

根据执行位置/方式，您可以编写一个简单的 sql 脚本并定义目标输出，例如：

SELECT clusterkey2
     , col1
     , col2
     , sum(val)
from table
group by clusterkey2, col1, col2

这将为您带来想要的结果。

从这里您可以做一些事情，但它们大部分都在文档中进行了概述： https://cloud.google.com/bigquery/docs/writing-results#writing_query_results

具体根据上面的内容，您要设置目的地 table。

需要注意的一点是，如果您不想要整个 table.

的聚合结果，您可能希望在 where 子句中包含分区键以帮助缩小数据范围。

将一个大查询 table 聚合到另一个大查询 table

Aggregating one bigquery table to another bigquery table

google-bigquery