BigQuery Extract to GS with multiple wildcard URI gives EMPTY blobs

Question

我正在尝试使用 Python 中的 google.cloud.storage.Client 方法 extract_table 从 BigQuery 中提取一个 table，方法是在 destination_uri 中给出多个数组作为参数通配符 uris。

destination_uri=['gs://{}/{}/{}-*'.format(bucket_name, prefix, i) for i in range(nb_node)]

预期的行为是 BigQuery 会将我的 table 平均拆分为多个 blob。

压缩后文件大小为 242 MB

真正发生的是，如果我给 7 个 URIS，则生成 1 个 242 MB 的文件和 6 个 20 B 的空文件。

其他配置参数为：destination_format = "NEWLINE_DELIMITED_JSON" and compression="GZIP"

知道为什么会这样吗？

非常感谢。

Answer 1

没有分发 "evenly" 概念。

导出文件可以是零星的，一个可能是几千兆字节，其他的可能是几兆字节。

这是here的描述，也是我们的经验：

If you are exporting more than 1 GB of data, you must export your data to multiple files. When you export your data to multiple files, the size of the files will vary.

Answer 2

很简单，如果您希望数据提取被均匀分片，请在 BQ 中使用分区 table。

BigQuery Extract to GS with multiple wildcard URI gives EMPTY blobs

BigQuery Extract to GS with multiple wildcard URI gives EMPTY blobs

python

cloud

distributed-computing

google-bigquery

google-cloud-platform