我可以在 Google BigQuery Connector for AWS Glue 中编写自定义查询吗？

Question

我正在创建一个将数据从 BigQuery 传输到 S3 的 Glue ETL 作业。与此类似 example，但使用我自己的数据集。
n.b.: 我将 BigQuery 连接器用于 AWS Glue v0.22.0-2 (link).

BigQuery 中的数据已经按日期分区，我希望每个 Glue 作业运行仅获取特定日期 (WHERE date = ...) 并将它们分组为 1 个 CSV 文件输出。但我找不到任何线索在哪里插入自定义 WHERE 查询。

在BigQuery源节点配置选项中，选项只有这些：

同样在生成的脚本中，它使用不适应自定义查询的 create_dynamic_frame.from_options（根据 documentation）。

# Script generated for node Google BigQuery Connector 0.22.0 for AWS Glue 3.0
GoogleBigQueryConnector0220forAWSGlue30_node1 = (
    glueContext.create_dynamic_frame.from_options(
        connection_type="marketplace.spark",
        connection_options={
            "parentProject": args["BQ_PROJECT"],
            "table": args["BQ_TABLE"],
            "connectionName": args["BQ_CONNECTION_NAME"],
        },
        transformation_ctx="GoogleBigQueryConnector0220forAWSGlue30_node1",
    )
)

那么，有什么方法可以编写自定义查询吗？或者有什么替代方法吗？

Answer 1

引用这个AWS sample project，我们可以在连接选项中使用filter：

filter – Passes the condition to select the rows to convert. If the table is partitioned, the selection is pushed down and only the rows in the specified partition are transferred to AWS Glue. In all other cases, all data is scanned and the filter is applied in AWS Glue Spark processing, but it still helps limit the amount of memory used in total.

脚本中使用示例：

# Script generated for node Google BigQuery Connector 0.22.0 for AWS Glue 3.0
GoogleBigQueryConnector0220forAWSGlue30_node1 = (
    glueContext.create_dynamic_frame.from_options(
        connection_type="marketplace.spark",
        connection_options={
            "parentProject": "...",
            "table": "...",
            "connectionName": "...",
            "filter": "date = 'yyyy-mm-dd'" #put condition here
        },
        transformation_ctx="GoogleBigQueryConnector0220forAWSGlue30_node1",
    )
)

我可以在 Google BigQuery Connector for AWS Glue 中编写自定义查询吗？

Can I write custom query in Google BigQuery Connector for AWS Glue?

etl

amazon-web-services

google-bigquery

aws-glue