运行 Google 数据流的私有 IP（Apache Beam 作业）

Question

我们在 Google Dataflow 环境中使用 Python SDK for apache beam。该工具非常棒，但我们担心这些工作的隐私问题，因为它看起来使用 Public IP 给运行工作人员。我们的问题是：

即使我们指定了网络和子网，我们是否还要担心使用 public IPS？
限制 public IP 在性能和安全方面的区别究竟是什么？
我们如何设置 Dataflow 以在私有 IP 上创建所有工作器？理论上，在以下模板中，我们将流程设置为不允许该行为（但它仍然允许）！根据 docs:

我们的工作模板如下所示：

options = PipelineOptions(flags = ['--requirements_file', './requirements.txt'])

#GoogleCloud options

google_cloud_options = options.view_as(GoogleCloudOptions)

google_cloud_options.project = PROJECT

google_cloud_options.job_name = job_name

google_cloud_options.staging_location = 'gs://{}/staging'.format(BUCKET)

google_cloud_options.temp_location = 'gs://{}/temp'.format(BUCKET)

google_cloud_options.region = REGION


#Worker options

worker_options = options.view_as(WorkerOptions)

worker_options.subnetwork = NETWORK

worker_options.max_num_workers = 25


options.view_as(StandardOptions).runner = RUNNER




 ### Note that we specified worker_options.subnetwork with our personal subnetwork. However, once we run our job it still looks like it creates workers on public ips.


### The code runs like this in the end

p = beam.Pipeline(options = options)

...

...

...

run = p.run()

run.wait_until_finish()

谢谢！

Answer 1

您还需要传递 --no_use_public_ips 选项，参见 https://cloud.google.com/dataflow/docs/guides/specifying-networks#python

运行 Google 数据流的私有 IP（Apache Beam 作业）

Private IPs to run Google Dataflow(Apache Beam jobs)

google-cloud-platform

google-cloud-dataflow

apache-beam