是否可以在 BigQuery 中对嵌套表进行分区?

Is it possible to partition nested tables in BigQuery?

我目前正在将我的数据仓库迁移到 BigQuery。我一直在尝试对数据库进行非规范化,正如我所读到的那样,它可以产生更高效、更便宜的查询。然而,这导致了一些嵌套的 tables。如果每个嵌套的 table 都有一个列“created_at”和“last_modified_at”,有没有什么方法可以使用这些值中的任何一个来划分我的 tables?

不,您不能通过嵌套的 table 对 table 进行分区。根据the docs:

You can partition BigQuery tables by:

Time-unit column: Tables are partitioned based on a TIMESTAMP, DATE, or DATETIME column in the table.

Ingestion time: Tables are partitioned based on the timestamp when BigQuery ingests the data.

Integer range: Tables are partitioned based on an integer column.

此外,分区必须是顶级字段,不能是 RECORD (STRUCT) 的叶字段:

Limitations

You cannot use legacy SQL to query partitioned tables or to write query results to partitioned tables.

Time-unit column-partitioned tables are subject to the following limitations:

The partitioning column must be either a scalar DATE, TIMESTAMP, or DATETIME column. While the mode of the column can be REQUIRED or NULLABLE, it cannot be REPEATED (array-based). The partitioning column must be a top-level field. You cannot use a leaf field from a RECORD (STRUCT) as the partitioning column.

Integer-range partitioned tables are subject to the following limitations:

The partitioning column must be an INTEGER column. While the mode of the column may be REQUIRED or NULLABLE, it cannot beREPEATED (array-based). The partitioning column must be a top-level field. You cannot use a leaf field from a RECORD (STRUCT) as the partitioning column.

虽然您可以在 BigQuery 中将更多数据类型与 tables 聚集在一起,但您不能使用 RECORD (STRUCT) 列来聚集 tables:

Clustering columns must be top-level, non-repeated columns of one of the following types:

DATE BOOL GEOGRAPHY INT64 NUMERIC BIGNUMERIC STRING TIMESTAMP DATETIME

如果您进行分区的原因是为了提高 date/time 查询的效率,并且如果每个嵌套的 table 涵盖相似的时间范围,我建议将 table 取消嵌套到parent table。如果您不想取消 table 的嵌套,将另一列添加到您的主 table 中可能会有所帮助,其中包含嵌套 table 中最早或最晚的日期并按新的分区列。