当我 运行 数据流上的自定义模板时出现错误 'Unable to parse file'
Error 'Unable to parse file' when I run a custom template on dataflow
我正在尝试编写自定义模板来读取 CSV 并将其输出到另一个 CSV。 objective 是 select 此 CSV 中所需的数据。当我在 Web 界面上 运行 它时,出现以下错误
我已尽可能减少代码以了解我的错误,但我仍然没有看到它。
我帮助自己找到了文档:https://cloud.google.com/dataflow/docs/guides/templates/creating-templates#creating-and-staging-templates
class UploadOptions(PipelineOptions):
@classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument(
'--input',
default='gs://[MYBUCKET]/input.csv',
help='Path of the file to read from')
parser.add_value_provider_argument(
'--output',
required=True,
help='Output file to write results to.')
pipeline_options = PipelineOptions(['--output', 'gs://[MYBUCKET]/output'])
p = beam.Pipeline(options=pipeline_options)
upload_options = pipeline_options.view_as(UploadOptions)
(p
| 'read' >> beam.io.Read(upload_options.input)
| 'Write' >> beam.io.WriteToText(upload_options.output, file_name_suffix='.csv'))
当前错误如下
无法解析文件 'gs://MYBUCKET/template.py'。
在终端中出现以下错误
错误: (gcloud.dataflow.jobs.run) FAILED_PRECONDITION: 无法解析模板文件 'gs://[MYBUCKET]/template.py'。
- '@type': type.googleapis.com/google.rpc.PreconditionFailure
违规行为:
- 描述:"Unexpected end of stream : expected '{'"
主题:0:0
输入:JSON
提前致谢
我设法解决了我的问题。问题来自我在管道读取中使用的变量。 Read 中必须使用 custom_options 变量而不是 known_args 变量
custom_options = pipeline_options.view_as(CustomPipelineOptions)
我制作了一个通用代码,如果有人需要,我会分享我的解决方案。
from __future__ import absolute_import
import argparse
import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.metrics.metric import MetricsFilter
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, SetupOptions
class CustomPipelineOptions(PipelineOptions):
"""
Runtime Parameters given during template execution
path and organization parameters are necessary for execution of pipeline
campaign is optional for committing to bigquery
"""
@classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument(
'--path',
type=str,
help='Path of the file to read from')
parser.add_value_provider_argument(
'--output',
type=str,
help='Output file if needed')
def run(argv=None):
parser = argparse.ArgumentParser()
known_args, pipeline_args = parser.parse_known_args(argv)
global cloud_options
global custom_options
pipeline_options = PipelineOptions(pipeline_args)
cloud_options = pipeline_options.view_as(GoogleCloudOptions)
custom_options = pipeline_options.view_as(CustomPipelineOptions)
pipeline_options.view_as(SetupOptions).save_main_session = True
p = beam.Pipeline(options=pipeline_options)
init_data = (p
| 'Hello World' >> beam.Create(['Hello World'])
| 'Read Input path' >> beam.Read(custom_options.path)
)
result = p.run()
# result.wait_until_finish
if __name__ == '__main__':
run()
然后启动以下命令在 GCP 上生成模板
python template.py --runner DataflowRunner --project $PROJECT --staging_location gs://$BUCKET/staging --temp_location gs://$BUCKET/temp --
template_location gs://$BUCKET/templates/$TemplateName
我正在尝试编写自定义模板来读取 CSV 并将其输出到另一个 CSV。 objective 是 select 此 CSV 中所需的数据。当我在 Web 界面上 运行 它时,出现以下错误
我已尽可能减少代码以了解我的错误,但我仍然没有看到它。 我帮助自己找到了文档:https://cloud.google.com/dataflow/docs/guides/templates/creating-templates#creating-and-staging-templates
class UploadOptions(PipelineOptions):
@classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument(
'--input',
default='gs://[MYBUCKET]/input.csv',
help='Path of the file to read from')
parser.add_value_provider_argument(
'--output',
required=True,
help='Output file to write results to.')
pipeline_options = PipelineOptions(['--output', 'gs://[MYBUCKET]/output'])
p = beam.Pipeline(options=pipeline_options)
upload_options = pipeline_options.view_as(UploadOptions)
(p
| 'read' >> beam.io.Read(upload_options.input)
| 'Write' >> beam.io.WriteToText(upload_options.output, file_name_suffix='.csv'))
当前错误如下
无法解析文件 'gs://MYBUCKET/template.py'。
在终端中出现以下错误
错误: (gcloud.dataflow.jobs.run) FAILED_PRECONDITION: 无法解析模板文件 'gs://[MYBUCKET]/template.py'。 - '@type': type.googleapis.com/google.rpc.PreconditionFailure 违规行为: - 描述:"Unexpected end of stream : expected '{'" 主题:0:0 输入:JSON
提前致谢
我设法解决了我的问题。问题来自我在管道读取中使用的变量。 Read 中必须使用 custom_options 变量而不是 known_args 变量
custom_options = pipeline_options.view_as(CustomPipelineOptions)
我制作了一个通用代码,如果有人需要,我会分享我的解决方案。
from __future__ import absolute_import
import argparse
import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.metrics.metric import MetricsFilter
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, SetupOptions
class CustomPipelineOptions(PipelineOptions):
"""
Runtime Parameters given during template execution
path and organization parameters are necessary for execution of pipeline
campaign is optional for committing to bigquery
"""
@classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument(
'--path',
type=str,
help='Path of the file to read from')
parser.add_value_provider_argument(
'--output',
type=str,
help='Output file if needed')
def run(argv=None):
parser = argparse.ArgumentParser()
known_args, pipeline_args = parser.parse_known_args(argv)
global cloud_options
global custom_options
pipeline_options = PipelineOptions(pipeline_args)
cloud_options = pipeline_options.view_as(GoogleCloudOptions)
custom_options = pipeline_options.view_as(CustomPipelineOptions)
pipeline_options.view_as(SetupOptions).save_main_session = True
p = beam.Pipeline(options=pipeline_options)
init_data = (p
| 'Hello World' >> beam.Create(['Hello World'])
| 'Read Input path' >> beam.Read(custom_options.path)
)
result = p.run()
# result.wait_until_finish
if __name__ == '__main__':
run()
然后启动以下命令在 GCP 上生成模板
python template.py --runner DataflowRunner --project $PROJECT --staging_location gs://$BUCKET/staging --temp_location gs://$BUCKET/temp --
template_location gs://$BUCKET/templates/$TemplateName