运行 Google 数据流的私有 IP(Apache Beam 作业)

Private IPs to run Google Dataflow(Apache Beam jobs)

我们在 Google Dataflow 环境中使用 Python SDK for apache beam。该工具非常棒,但我们担心这些工作的隐私问题,因为它看起来使用 Public IP 给 运行 工作人员。我们的问题是:

我们的工作模板如下所示:

options = PipelineOptions(flags = ['--requirements_file', './requirements.txt'])

#GoogleCloud options

google_cloud_options = options.view_as(GoogleCloudOptions)

google_cloud_options.project = PROJECT

google_cloud_options.job_name = job_name

google_cloud_options.staging_location = 'gs://{​​​​​​​​}​​​​​​​​/staging'.format(BUCKET)

google_cloud_options.temp_location = 'gs://{​​​​​​​​}​​​​​​​​/temp'.format(BUCKET)

google_cloud_options.region = REGION


#Worker options

worker_options = options.view_as(WorkerOptions)

worker_options.subnetwork = NETWORK

worker_options.max_num_workers = 25


options.view_as(StandardOptions).runner = RUNNER




 ### Note that we specified worker_options.subnetwork with our personal subnetwork. However, once we run our job it still looks like it creates workers on public ips.


### The code runs like this in the end

p = beam.Pipeline(options = options)

...

...

...

run = p.run()

run.wait_until_finish()

谢谢!

您还需要传递 --no_use_public_ips 选项,参见 https://cloud.google.com/dataflow/docs/guides/specifying-networks#python