bigquery DataFlow Error: Cannot read and write in different locations while reading and writing in EU
bigquery DataFlow Error: Cannot read and write in different locations while reading and writing in EU
我有一个简单的 Google DataFlow 任务。它从 BigQuery table 读取并写入另一个,就像这样:
(p
| beam.io.Read( beam.io.BigQuerySource(
query='select dia, import from DS1.t_27k where true',
use_standard_sql=True))
| beam.io.Write(beam.io.BigQuerySink(
output_table,
dataset='DS1',
project=project,
schema='dia:DATE, import:FLOAT',
create_disposition=CREATE_IF_NEEDED,
write_disposition=WRITE_TRUNCATE
)
)
我想问题是这条管道似乎需要一个临时数据集才能完成工作。而且我无法强制定位此临时数据集。因为我的 DS1 在欧盟 (#EUROPE-WEST1) 而临时数据集在美国(我猜),任务失败:
WARNING:root:Dataset m-h-0000:temp_dataset_e433a0ef19e64100000000000001a does not exist so we will create it as temporary with location=None
WARNING:root:A task failed with exception.
HttpError accessing <https://www.googleapis.com/bigquery/v2/projects/m-h-000000/queries/b8b2f00000000000000002bed336369d?alt=json&maxResults=10000>: response: <{'status': '400', 'content-length': '292', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'expires': 'Sat, 14 Oct 2017 20:29:15 GMT', 'vary': 'Origin, X-Origin', 'server': 'GSE', '-content-encoding': 'gzip', 'cache-control': 'private, max-age=0', 'date': 'Sat, 14 Oct 2017 20:29:15 GMT', 'x-frame-options': 'SAMEORIGIN', 'alt-svc': 'quic=":443"; ma=2592000; v="39,38,37,35"', 'content-type': 'application/json; charset=UTF-8'}>, content <{
"error": {
"errors": [
{
"domain": "global",
"reason": "invalid",
"message": "Cannot read and write in different locations: source: EU, destination: US"
}
],
"code": 400,
"message": "Cannot read and write in different locations: source: EU, destination: US"
}
}
管道选项:
options = PipelineOptions()
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = 'm-h'
google_cloud_options.job_name = 'myjob3'
google_cloud_options.staging_location = r'gs://p_df/staging' #EUROPE-WEST1
google_cloud_options.region=r'europe-west1'
google_cloud_options.temp_location = r'gs://p_df/temp' #EUROPE-WEST1
options.view_as(StandardOptions).runner = 'DirectRunner' #'DataflowRunner'
p = beam.Pipeline(options=options)
我该如何避免这个错误?
注意 错误只出现在我运行 为DirectRunner
时。
错误 Cannot read and write in different locations
很容易解释,它的发生可能是因为:
- BigQuery 数据集在欧盟,而您 运行在美国使用 DataFlow
- 您的 GCS 存储桶在欧盟,您 运行在美国使用 DataFlow
正如您在问题中指定的那样,您已经在欧盟的 GCS 中创建了临时位置,并且您的 BigQuery 数据集也位于欧盟,因此您也必须 运行 欧盟的 DataFlow 作业.
为了实现这个,你需要在PipelineOptions
中指定zone
参数,像这样:
options = PipelineOptions()
wo = options.view_as(WorkerOptions) # type: WorkerOptions
wo.zone = "europe-west1-b"
# rest of your options:
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = 'm-h'
google_cloud_options.job_name = 'myjob3'
google_cloud_options.staging_location = r'gs://p_df/staging' # EUROPE-WEST1
google_cloud_options.region = r'europe-west1'
google_cloud_options.temp_location = r'gs://p_df/temp' # EUROPE-WEST1
options.view_as(StandardOptions).runner = 'DataFlowRunner'
p = beam.Pipeline(options=options)
Python DirectRunner 中使用的 BigQuerySource 转换不会自动确定临时表的位置。请参阅 BEAM-1909 了解问题。
当使用 DataflowRunner 时,这应该有效。
我有一个简单的 Google DataFlow 任务。它从 BigQuery table 读取并写入另一个,就像这样:
(p
| beam.io.Read( beam.io.BigQuerySource(
query='select dia, import from DS1.t_27k where true',
use_standard_sql=True))
| beam.io.Write(beam.io.BigQuerySink(
output_table,
dataset='DS1',
project=project,
schema='dia:DATE, import:FLOAT',
create_disposition=CREATE_IF_NEEDED,
write_disposition=WRITE_TRUNCATE
)
)
我想问题是这条管道似乎需要一个临时数据集才能完成工作。而且我无法强制定位此临时数据集。因为我的 DS1 在欧盟 (#EUROPE-WEST1) 而临时数据集在美国(我猜),任务失败:
WARNING:root:Dataset m-h-0000:temp_dataset_e433a0ef19e64100000000000001a does not exist so we will create it as temporary with location=None
WARNING:root:A task failed with exception.
HttpError accessing <https://www.googleapis.com/bigquery/v2/projects/m-h-000000/queries/b8b2f00000000000000002bed336369d?alt=json&maxResults=10000>: response: <{'status': '400', 'content-length': '292', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'expires': 'Sat, 14 Oct 2017 20:29:15 GMT', 'vary': 'Origin, X-Origin', 'server': 'GSE', '-content-encoding': 'gzip', 'cache-control': 'private, max-age=0', 'date': 'Sat, 14 Oct 2017 20:29:15 GMT', 'x-frame-options': 'SAMEORIGIN', 'alt-svc': 'quic=":443"; ma=2592000; v="39,38,37,35"', 'content-type': 'application/json; charset=UTF-8'}>, content <{
"error": {
"errors": [
{
"domain": "global",
"reason": "invalid",
"message": "Cannot read and write in different locations: source: EU, destination: US"
}
],
"code": 400,
"message": "Cannot read and write in different locations: source: EU, destination: US"
}
}
管道选项:
options = PipelineOptions()
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = 'm-h'
google_cloud_options.job_name = 'myjob3'
google_cloud_options.staging_location = r'gs://p_df/staging' #EUROPE-WEST1
google_cloud_options.region=r'europe-west1'
google_cloud_options.temp_location = r'gs://p_df/temp' #EUROPE-WEST1
options.view_as(StandardOptions).runner = 'DirectRunner' #'DataflowRunner'
p = beam.Pipeline(options=options)
我该如何避免这个错误?
注意 错误只出现在我运行 为DirectRunner
时。
错误 Cannot read and write in different locations
很容易解释,它的发生可能是因为:
- BigQuery 数据集在欧盟,而您 运行在美国使用 DataFlow
- 您的 GCS 存储桶在欧盟,您 运行在美国使用 DataFlow
正如您在问题中指定的那样,您已经在欧盟的 GCS 中创建了临时位置,并且您的 BigQuery 数据集也位于欧盟,因此您也必须 运行 欧盟的 DataFlow 作业.
为了实现这个,你需要在PipelineOptions
中指定zone
参数,像这样:
options = PipelineOptions()
wo = options.view_as(WorkerOptions) # type: WorkerOptions
wo.zone = "europe-west1-b"
# rest of your options:
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = 'm-h'
google_cloud_options.job_name = 'myjob3'
google_cloud_options.staging_location = r'gs://p_df/staging' # EUROPE-WEST1
google_cloud_options.region = r'europe-west1'
google_cloud_options.temp_location = r'gs://p_df/temp' # EUROPE-WEST1
options.view_as(StandardOptions).runner = 'DataFlowRunner'
p = beam.Pipeline(options=options)
Python DirectRunner 中使用的 BigQuerySource 转换不会自动确定临时表的位置。请参阅 BEAM-1909 了解问题。
当使用 DataflowRunner 时,这应该有效。