如何在数据流作业启动的预定义模板中禁用 public ip
How to disable public ip in a predefined template for a dataflow job launch
我正在尝试使用 google 的预定义模板部署数据流作业,使用 python api
我不希望我的数据流计算实例有一个 public ip,所以我使用这样的东西:
GCSPATH="gs://dataflow-templates/latest/Cloud_PubSub_to_GCS_Text"
BODY = {
"jobName": "{jobname}".format(jobname=JOBNAME),
"parameters": {
"inputTopic" : "projects/{project}/topics/{topic}".format(project=PROJECT, topic=TOPIC),
"outputDirectory": "gs://{bucket}/pubsub-backup-v2/{topic}/".format(bucket=BUCKET, topic=TOPIC),
"outputFilenamePrefix": "{topic}-".format(topic=TOPIC),
"outputFilenameSuffix": ".txt"
},
"environment": {
"machineType": "n1-standard-1",
"usePublicIps": False,
"subnetwork": SUBNETWORK,
}
}
request = service.projects().templates().launch(projectId=PROJECT, gcsPath=GCSPATH, body=BODY)
response = request.execute()
但是我得到这个错误:
raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://dataflow.googleapis.com/v1b3/projects/ABC/templates:launch?alt=json&gcsPath=gs%3A%2F%2Fdataflow-templates%2Flatest%2FCloud_PubSub_to_GCS_Text returned "Invalid JSON payload received. Unknown name "use_public_ips" at 'launch_parameters.environment': Cannot find field.">
如果我删除 usePublicIps,它会通过,但我的计算实例会使用 public ip 进行部署。
通过阅读文档,Specifying your Network and Subnetwork on Dataflow I see that python 使用 use_public_ips=false
而不是 Java 使用的 usePublicIps=false
。尝试更改参数。
此外,请记住:
When you turn off public IP addresses, the Cloud Dataflow pipeline can
access resources only in the following places:
another instance in the same VPC network
a Shared VPC network
a network with VPC Network Peering enabled
我找到了一种方法来完成这项工作
运行 自定义参数模板
mvn compile exec:java \
-Dexec.mainClass=com.google.cloud.teleport.templates.PubsubToText \
-Dexec.cleanupDaemonThreads=false \
-Dexec.args=" \
--project=${PROJECT_ID} \
--stagingLocation=gs://${BUCKET}/dataflow/pipelines/${PIPELINE_FOLDER}/staging \
--tempLocation=gs://${BUCKET}/dataflow/pipelines/${PIPELINE_FOLDER}/temp \
--runner=DataflowRunner \
--windowDuration=2m \
--numShards=1 \
--inputTopic=projects/${PROJECT_ID}/topics/$TOPIC \
--outputDirectory=gs://${BUCKET}/temp/ \
--outputFilenamePrefix=windowed-file \
--outputFilenameSuffix=.txt \
--workerMachineType=n1-standard-1 \
--subnetwork=${SUBNET} \
--usePublicIps=false"
参数usePublicIps 无法在运行时被覆盖。您需要将此参数的值为 false 发送到数据流模板生成命令中。
mvn compile exec:java -Dexec.mainClass=class -Dexec.args="--project=$PROJECT \
--runner=DataflowRunner --stagingLocation=bucket --templateLocation=bucket \
--usePublicIps=false"
它将在模板的 JSON 上添加一个条目 ipConfiguration,表明工作人员只需要使用私有 IP。
链接是模板 JSON 的打印屏幕,有和没有 ipConfiguration 条目。
Template with usePublicIps=false
Template without usePublicIps=false
您似乎在使用 projects.locations.templates.create 中的 json
环境块documented here需要跟随
"environment": {
"machineType": "n1-standard-1",
"ipConfiguration": "WORKER_IP_PRIVATE",
"subnetwork": SUBNETWORK // sample: regions/${REGION}/subnetworks/${SUBNET}
}
ipConfiguration 的值是记录在 Job.WorkerIPAddressConfiguration
中的枚举
我正在尝试使用 google 的预定义模板部署数据流作业,使用 python api
我不希望我的数据流计算实例有一个 public ip,所以我使用这样的东西:
GCSPATH="gs://dataflow-templates/latest/Cloud_PubSub_to_GCS_Text"
BODY = {
"jobName": "{jobname}".format(jobname=JOBNAME),
"parameters": {
"inputTopic" : "projects/{project}/topics/{topic}".format(project=PROJECT, topic=TOPIC),
"outputDirectory": "gs://{bucket}/pubsub-backup-v2/{topic}/".format(bucket=BUCKET, topic=TOPIC),
"outputFilenamePrefix": "{topic}-".format(topic=TOPIC),
"outputFilenameSuffix": ".txt"
},
"environment": {
"machineType": "n1-standard-1",
"usePublicIps": False,
"subnetwork": SUBNETWORK,
}
}
request = service.projects().templates().launch(projectId=PROJECT, gcsPath=GCSPATH, body=BODY)
response = request.execute()
但是我得到这个错误:
raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://dataflow.googleapis.com/v1b3/projects/ABC/templates:launch?alt=json&gcsPath=gs%3A%2F%2Fdataflow-templates%2Flatest%2FCloud_PubSub_to_GCS_Text returned "Invalid JSON payload received. Unknown name "use_public_ips" at 'launch_parameters.environment': Cannot find field.">
如果我删除 usePublicIps,它会通过,但我的计算实例会使用 public ip 进行部署。
通过阅读文档,Specifying your Network and Subnetwork on Dataflow I see that python 使用 use_public_ips=false
而不是 Java 使用的 usePublicIps=false
。尝试更改参数。
此外,请记住:
When you turn off public IP addresses, the Cloud Dataflow pipeline can access resources only in the following places:
another instance in the same VPC network
a Shared VPC network
a network with VPC Network Peering enabled
我找到了一种方法来完成这项工作
运行 自定义参数模板
mvn compile exec:java \
-Dexec.mainClass=com.google.cloud.teleport.templates.PubsubToText \
-Dexec.cleanupDaemonThreads=false \
-Dexec.args=" \
--project=${PROJECT_ID} \
--stagingLocation=gs://${BUCKET}/dataflow/pipelines/${PIPELINE_FOLDER}/staging \
--tempLocation=gs://${BUCKET}/dataflow/pipelines/${PIPELINE_FOLDER}/temp \
--runner=DataflowRunner \
--windowDuration=2m \
--numShards=1 \
--inputTopic=projects/${PROJECT_ID}/topics/$TOPIC \
--outputDirectory=gs://${BUCKET}/temp/ \
--outputFilenamePrefix=windowed-file \
--outputFilenameSuffix=.txt \
--workerMachineType=n1-standard-1 \
--subnetwork=${SUBNET} \
--usePublicIps=false"
参数usePublicIps 无法在运行时被覆盖。您需要将此参数的值为 false 发送到数据流模板生成命令中。
mvn compile exec:java -Dexec.mainClass=class -Dexec.args="--project=$PROJECT \
--runner=DataflowRunner --stagingLocation=bucket --templateLocation=bucket \
--usePublicIps=false"
它将在模板的 JSON 上添加一个条目 ipConfiguration,表明工作人员只需要使用私有 IP。
链接是模板 JSON 的打印屏幕,有和没有 ipConfiguration 条目。
Template with usePublicIps=false
Template without usePublicIps=false
您似乎在使用 projects.locations.templates.create 中的 json 环境块documented here需要跟随
"environment": {
"machineType": "n1-standard-1",
"ipConfiguration": "WORKER_IP_PRIVATE",
"subnetwork": SUBNETWORK // sample: regions/${REGION}/subnetworks/${SUBNET}
}
ipConfiguration 的值是记录在 Job.WorkerIPAddressConfiguration
中的枚举