来自 Apache Beam 的 BigQuery 授权视图

BigQuery Authorized Views from Apache Beam

我正在尝试使用 Apache Beam 在 BigQuery 中查询视图。

视图可以访问它引用的所有数据集。 Dataflow/GCE 服务帐户可以访问视图,但 不能访问其基础数据集(这应该不是问题)。

当我尝试 运行 查询授权视图的作业时,出现如下错误:

java.lang.RuntimeException: java.io.IOException: Unable to get table: test_13249, aborting after 9 retries.
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl.executeWithRetries(BigQueryServicesImpl.java:1004)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.getTable(BigQueryServicesImpl.java:491)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.getTable(BigQueryServicesImpl.java:477)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.getTable(BigQueryServicesImpl.java:471)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryQueryHelper.executeQuery(BigQueryQueryHelper.java:109)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryQuerySourceDef.getTableReference(BigQueryQuerySourceDef.java:113)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryQuerySource.getTableToExtract(BigQueryQuerySource.java:65)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.extractFiles(BigQuerySourceBase.java:110)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.split(BigQuerySourceBase.java:148)
    at org.apache.beam.runners.dataflow.worker.WorkerCustomSources.splitAndValidate(WorkerCustomSources.java:290)
    at org.apache.beam.runners.dataflow.worker.WorkerCustomSources.performSplitTyped(WorkerCustomSources.java:212)
    at org.apache.beam.runners.dataflow.worker.WorkerCustomSources.performSplitWithApiLimit(WorkerCustomSources.java:196)
    at org.apache.beam.runners.dataflow.worker.WorkerCustomSources.performSplit(WorkerCustomSources.java:175)
    at org.apache.beam.runners.dataflow.worker.WorkerCustomSourceOperationExecutor.execute(WorkerCustomSourceOperationExecutor.java:78)
    at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:417)
    at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:386)
    at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:311)
    at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140)
    at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120)
    at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 403 Forbidden
{
  "code" : 403,
  "errors" : [ {
    "domain" : "global",
    "message" : "Access Denied: Table my-gcp-project:bigquery_dataset_any.test_13249: User does not have bigquery.tables.get permission for table my-gcp-project:bigquery_dataset_any.test_13249.",
    "reason" : "accessDenied"
  } ],
  "message" : "Access Denied: Table my-gcp-project:bigquery_dataset_any.test_13249: User does not have bigquery.tables.get permission for table my-gcp-project:bigquery_dataset_any.test_13249.",
  "status" : "PERMISSION_DENIED"
}

Beam 在处理 BigQuery 时试图变得聪明。发生此错误是因为 Beam 检查查询引用的表,而授权视图并不总是可能发生这种情况。

要解决此问题,您可以使用 BigQueryIO.readTableRows()BigQueryIO.read(SerializableFunction) 中的方法 withQueryLocation。这将允许 Beam 使用提供的查询位置,而不是推断位置。

因此:

BigQueryIO.readTableRows()
    .fromQuery("SELECT * FROM my_authorized_view")
    .withQueryLocation("US")  // Whatever location is convenient for you
    ...

这应该可以解决您的问题。