运行 Google 数据流的私有 IP(Apache Beam 作业)
Private IPs to run Google Dataflow(Apache Beam jobs)
我们在 Google Dataflow 环境中使用 Python SDK for apache beam。该工具非常棒,但我们担心这些工作的隐私问题,因为它看起来使用 Public IP 给 运行 工作人员。我们的问题是:
- 即使我们指定了网络和子网,我们是否还要担心使用 public IPS?
- 限制 public IP 在性能和安全方面的区别究竟是什么?
- 我们如何设置 Dataflow 以在私有 IP 上创建所有工作器?理论上,在以下模板中,我们将流程设置为不允许该行为(但它仍然允许)!根据 docs:
我们的工作模板如下所示:
options = PipelineOptions(flags = ['--requirements_file', './requirements.txt'])
#GoogleCloud options
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = PROJECT
google_cloud_options.job_name = job_name
google_cloud_options.staging_location = 'gs://{}/staging'.format(BUCKET)
google_cloud_options.temp_location = 'gs://{}/temp'.format(BUCKET)
google_cloud_options.region = REGION
#Worker options
worker_options = options.view_as(WorkerOptions)
worker_options.subnetwork = NETWORK
worker_options.max_num_workers = 25
options.view_as(StandardOptions).runner = RUNNER
### Note that we specified worker_options.subnetwork with our personal subnetwork. However, once we run our job it still looks like it creates workers on public ips.
### The code runs like this in the end
p = beam.Pipeline(options = options)
...
...
...
run = p.run()
run.wait_until_finish()
谢谢!
您还需要传递 --no_use_public_ips
选项,参见 https://cloud.google.com/dataflow/docs/guides/specifying-networks#python
我们在 Google Dataflow 环境中使用 Python SDK for apache beam。该工具非常棒,但我们担心这些工作的隐私问题,因为它看起来使用 Public IP 给 运行 工作人员。我们的问题是:
- 即使我们指定了网络和子网,我们是否还要担心使用 public IPS?
- 限制 public IP 在性能和安全方面的区别究竟是什么?
- 我们如何设置 Dataflow 以在私有 IP 上创建所有工作器?理论上,在以下模板中,我们将流程设置为不允许该行为(但它仍然允许)!根据 docs:
我们的工作模板如下所示:
options = PipelineOptions(flags = ['--requirements_file', './requirements.txt'])
#GoogleCloud options
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = PROJECT
google_cloud_options.job_name = job_name
google_cloud_options.staging_location = 'gs://{}/staging'.format(BUCKET)
google_cloud_options.temp_location = 'gs://{}/temp'.format(BUCKET)
google_cloud_options.region = REGION
#Worker options
worker_options = options.view_as(WorkerOptions)
worker_options.subnetwork = NETWORK
worker_options.max_num_workers = 25
options.view_as(StandardOptions).runner = RUNNER
### Note that we specified worker_options.subnetwork with our personal subnetwork. However, once we run our job it still looks like it creates workers on public ips.
### The code runs like this in the end
p = beam.Pipeline(options = options)
...
...
...
run = p.run()
run.wait_until_finish()
谢谢!
您还需要传递 --no_use_public_ips
选项,参见 https://cloud.google.com/dataflow/docs/guides/specifying-networks#python