从非常大的 BigQuery table 读取小批量的最佳方法？

Question

我有一个大型（>200M 行）BigQuery table，我想从中读取小批量，以便我可以训练机器学习模型。数据集太大，无法放入内存，因此我无法一次全部读取，但我希望我的模型能够从所有数据中学习。我还想避免由于网络延迟而发出太多查询，因为这会减慢训练过程。在 Python 中执行此操作的最佳方法是什么？

Answer 1

你在使用 Tensorflow 吗？

tfio.bigquery.BigQueryClient 0.9.0 解决了这个问题：

read_session(
    parent,
    project_id,
    table_id,
    dataset_id,
    selected_fields,
    output_types=None,
    row_restriction='',
    requested_streams=1
)

与

requested_streams: Initial number of streams. If unset or 0, we will provide a value of streams so as to produce reasonable throughput. Must be non-negative. The number of streams may be lower than the requested number, depending on the amount parallelism that is reasonable for the table and the maximum amount of parallelism allowed by the system.

https://www.tensorflow.org/io/api_docs/python/tfio/bigquery/BigQueryClient

源代码：

https://github.com/tensorflow/io/tree/master/tensorflow_io/bigquery

Answer 2

Felipe 的答案在您使用 TF 时有效，但如果您使用 pytorch 或想要一些与您的训练平台无关的东西，faucetml 可能会很好：

https://github.com/econti/faucetml

根据文档中的示例，如果您正在训练两个时期：

fml = get_client(
    datastore="bigquery",
    credential_path="bq_creds.json",
    table_name="my_training_table",
    ds="2020-01-20",
    epochs=2,
    batch_size=1024
    chunk_size=1024 * 10000,
    test_split_percent=20,
)
for epoch in range(2):
    fml.prep_for_epoch()
    batch = fml.get_batch()
    while batch is not None:
        train(batch)
        batch = fml.get_batch()

从非常大的 BigQuery table 读取小批量的最佳方法？

Best way to read minibatches from a very large BigQuery table?

python

training-data

google-bigquery