如何以 CSV 格式将数据从 Bigquery 导出到外部服务器？

Question

我需要自动执行从 Google Big Query 中提取数据并导出到 GCP 外部外部服务器中的外部 CSV 的过程。

我正在研究如何从我的外部服务器找到运行的一些命令。但我更喜欢在 GCP 中做所有事情，以避免可能出现的问题。

到运行到 Google 存储中 CSV 的查询

bq --location=US extract --compression GZIP 'dataset.table' gs://example-bucket/myfile.csv

从 Google 存储下载 csv

gsutil cp gs://[BUCKET_NAME]/[OBJECT_NAME] [OBJECT_DESTINATION]

不过我想听听你的建议

Answer 1

如果您想完全自动化此过程，我会执行以下操作：

创建一个 Cloud Function 来处理导出:

这是更轻量级的解决方案，因为 Cloud Functions 是无服务器的，并且提供了使用 Client Libraries. See the quickstart 实现代码的灵活性，我建议您首先使用控制台创建函数。

在这个例子中我推荐你触发云函数from an HTTP request，即当函数URL被调用时，它会运行里面的代码。

Python 中的示例云函数代码，它在发出 HTTP 请求时创建导出：

main.py

from google.cloud import bigquery

def hello_world(request):
    project_name = "MY_PROJECT"
    bucket_name = "MY_BUCKET"
    dataset_name = "MY_DATASET"
    table_name = "MY_TABLE"
    destination_uri = "gs://{}/{}".format(bucket_name, "bq_export.csv.gz")

    bq_client = bigquery.Client(project=project_name)

    dataset = bq_client.dataset(dataset_name, project=project_name)
    table_to_export = dataset.table(table_name)

    job_config = bigquery.job.ExtractJobConfig()
    job_config.compression = bigquery.Compression.GZIP

    extract_job = bq_client.extract_table(
        table_to_export,
        destination_uri,
        # Location must match that of the source table.
        location="US",
        job_config=job_config,
    )  
    return "Job with ID {} started exporting data from {}.{} to {}".format(extract_job.job_id, dataset_name, table_name, destination_uri)

requirements.txt

google-cloud-bigquery

请注意，该作业将运行在后台异步进行，您将收到一个带有作业 ID 的 return 响应，您可以使用它来检查云中导出作业的状态Shell，作者：运行宁：

bq show -j <job_id>

创建 Cloud Scheduler 计划作业:

按照这个documentation to get started. You can set the Frequency with the standard cron format，例如0 0 * * *将在每天午夜运行作业。

作为目标，选择 HTTP，在 URL 中放置 Cloud Function HTTP URL（您可以在控制台中的 Cloud Function 详细信息中找到它，在触发器选项卡），并作为 HTTP method 选择 GET。

创建它，您可以在控制台中按 Run now 按钮在 Cloud Scheduler 中对其进行测试。

同步您的外部服务器和存储桶:

到目前为止，您只能每 24 小时安排一次导出到运行，现在要将存储桶内容与本地计算机同步，您可以使用 gsutil rsync 命令。如果你想保存导入，比如说 my_exports 文件夹，你可以运行，在你的外部服务器中：

gsutil rsync gs://BUCKET_WITH_EXPORTS /local-path-to/my_exports

要在您的服务器中定期运行此命令，您可以在外部服务器中创建一个标准 cron job in your crontab，每天运行，就在几个小时后比bigquery导出，确保导出已经完成。

额外:

我已将 Cloud Function 中的大部分变量硬编码为始终相同。但是，如果您执行 POST 请求而不是 GET 请求，则可以向该函数发送参数，并将参数作为正文中的数据发送。

您必须更改 Cloud Scheduler 作业以向 Cloud Function HTTP URL 发送 POST 请求，并且在同一位置您可以设置正文以发送有关例如 table、dataset 和 bucket。这将允许您运行在不同的时间从不同的表导出到不同的存储桶。

如何以 CSV 格式将数据从 Bigquery 导出到外部服务器？

How can I export data from Bigquery to an external server in a CSV?

google-cloud-storage

google-bigquery

google-cloud-platform

google-cloud-dataflow