使用 Apache Beam 向 BigQuery 传播插入时如何指定 insertId

How to specify insertId when spreaming insert to BigQuery using Apache Beam

BigQuery 支持流式插入的重复数据删除。如何使用 Apache Beam 使用此功能?

https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency

To help ensure data consistency, you can supply insertId for each inserted row. BigQuery remembers this ID for at least one minute. If you try to stream the same set of rows within that time period and the insertId property is set, BigQuery uses the insertId property to de-duplicate your data on a best effort basis. You might have to retry an insert because there's no way to determine the state of a streaming insert under certain error conditions, such as network errors between your system and BigQuery or internal errors within BigQuery. If you retry an insert, use the same insertId for the same set of rows so that BigQuery can attempt to de-duplicate your data. For more information, see troubleshooting streaming inserts.

我在 Java 文档中找不到这样的功能。 https://beam.apache.org/releases/javadoc/2.9.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.html

this question中,他建议在TableRow中设置insertId。这是正确的吗?

https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/TableRow.html?is-external=true

BigQuery 客户端库具有此功能。

https://googleapis.github.io/google-cloud-java/google-cloud-clients/apidocs/index.html?com/google/cloud/bigquery/package-summary.html https://github.com/googleapis/google-cloud-java/blob/master/google-cloud-clients/google-cloud-bigquery/src/main/java/com/google/cloud/bigquery/InsertAllRequest.java#L134

  • Pub/Sub + Beam/Dataflow + BigQuery: "Exactly once"应该可以保证,不需要很担心这个。当您要求 Dataflow 使用 FILE_LOADS instead of STREAMING_INSERTS 插入到 BigQuery 时,这种保证会更强。

  • Kafka + Beam/Dataflow + BigQuery:如果消息可以从 Kafka 发出多次(例如,如果生产者重试插入) ,那么您需要处理重复数据删除。在 BigQuery 中(根据您的评论,目前已实施),或在具有 .apply(Distinct.create()) 转换的 Dataflow 中。

正如 Felipe 在评论中提到的,Dataflow 似乎已经在使用 insertId 来实现 "exactly once"。所以我们不能手动指定insertId。