GCP 部署实例因 ansible 脚本失败
GCP deploy instance fails from ansible script
一年多以来,我一直在通过 ansible 脚本在 GCP 中部署集群,但突然间,我的一个脚本一直给我这个错误:
libcloud.common.google.GoogleBaseError: u\"The zone 'projects/[project]/zones/europe-west1-d' does not have enough resources available to fulfill the request. Try a different zone, or try again later.
明显的原因是我没有足够的资源,但没有太多改变,配额看起来不错:
ansible 脚本本身要求不高。
我正在使用 100GB SSD 创建 3 个 n1-standard-4 实例。
请参阅下面的脚本片段:
tasks:
- name: create boot disks
gce_pd:
disk_type: pd-ssd
image: "debian-9-stretch-v20171025"
name: "{{ item.node }}-disk"
size_gb: 100
state: present
zone: "europe-west1-d"
service_account_email: "{{ service_account_email }}"
credentials_file: "{{ credentials_file }}"
project_id: "{{ project_id }}"
with_items: "{{nodes}}"
async: 3600
poll: 2
- name: create instances
gce:
instance_names: "{{item.node}}"
zone: "europe-west1-d"
machine_type: "n1-standard-4"
preemptible: "{{ false if item.num == '0' else true }}"
disk_auto_delete: true
disks:
- name: "{{ item.node }}-disk"
mode: READ_WRITE
state: present
service_account_email: "{{ service_account_email }}"
service_account_permissions: "compute-rw"
credentials_file: "{{ credentials_file }}"
project_id: "{{ project_id }}"
tags: "elasticsearch"
register: gce_raw_results
with_items: "{{nodes}}"
async: 3600
poll: 2
更新 1:
- 服务账号是整个项目的编辑者。所以正确的问题似乎不太可能。
- 它从 2018 年 3 月 24 日开始发生。从那以后的每个晚上。因此,如果这是一个 'out of stock' 问题,那将是非常巧合的,对吧?
此外,到目前为止,我一整天都在 运行 这个脚本,但大部分时间它都失败了(成功见下文)。
- 我测试了几次,可能与实例上的 'preemptible' 标志有关。 (我启动了 3 个节点,但至少第一个必须保持工作状态)=>
preemptible: "{{ false if item.num == '0' else true }}"
如果我关闭 preemptible (false) 然后它运行顺利。
'workaround' 似乎只是不使用可抢占实例,但这曾经工作了一年而没有失败一次。有什么改变吗?
GCP 的 API 有变化吗? ansible gce 没有实现这些更改吗?
完整的错误是:
TASK [Gathering Facts]
****************************************************************************************************************************************************************************************************************************************************************************************************** ok: [localhost]
TASK [create boot disks]
**************************************************************************************************************************************************************************************************************************************************************************************************** changed: [localhost] => (item={u'node': u'elasticsearch-link-0',
u'ip_field': u'private_ip', u'zone': u'europe-west1-d',
u'cluster_name': u'elasticsearch-link', u'num': u'0', u'machine_type':
u'n1-standard-4', u'project_id': u'[projectid]'}) changed: [localhost]
=> (item={u'node': u'elasticsearch-link-1', u'ip_field': u'private_ip', u'zone': u'europe-west1-d', u'cluster_name':
u'elasticsearch-link', u'num': u'1', u'machine_type':
u'n1-standard-4', u'project_id': u'[projectid]'}) ok: [localhost] =>
(item={u'node': u'elasticsearch-link-2', u'ip_field': u'private_ip',
u'zone': u'europe-west1-d', u'cluster_name': u'elasticsearch-link',
u'num': u'2', u'machine_type': u'n1-standard-4', u'project_id':
u'[projectid]'})
TASK [create instances]
***************************************************************************************************************************************************************************************************************************************************************************************************** changed: [localhost] => (item={u'node': u'elasticsearch-link-0',
u'ip_field': u'private_ip', u'zone': u'europe-west1-d',
u'cluster_name': u'elasticsearch-link', u'num': u'0', u'machine_type':
u'n1-standard-4', u'project_id': u'[projectid]'}) changed: [localhost]
=> (item={u'node': u'elasticsearch-link-1', u'ip_field': u'private_ip', u'zone': u'europe-west1-d', u'cluster_name':
u'elasticsearch-link', u'num': u'1', u'machine_type':
u'n1-standard-4', u'project_id': u'[projectid]'}) failed: [localhost]
(item={u'node': u'elasticsearch-link-2', u'ip_field': u'private_ip',
u'zone': u'europe-west1-d', u'cluster_name': u'elasticsearch-link',
u'num': u'2', u'machine_type': u'n1-standard-4', u'project_id':
u'[projectid]'}) => {"ansible_job_id": "371957735383.2688",
"changed": false, "cmd":
"/tmp/.ansible-airflow/ansible-tmp-1522742180.0-71790706749341/gce.py",
"data": "", "failed": 1, "finished": 1, "item": {"cluster_name":
"elasticsearch-link", "ip_field": "private_ip", "machine_type":
"n1-standard-4", "node": "elasticsearch-link-2", "num": "2",
"project_id": "[projectid]", "zone": "europe-west1-d"}, "msg":
"Traceback (most recent call last):\n File
\"/tmp/.ansible-airflow/ansible-tmp-1522742180.0-71790706749341/async_wrapper.py\",
line 158, in _run_module\n (filtered_outdata, json_warnings) =
_filter_non_json_lines(outdata)\n File \"/tmp/.ansible-airflow/ansible-tmp-1522742180.0-71790706749341/async_wrapper.py\",
line 99, in _filter_non_json_lines\n raise ValueError('No start of
json char found')\nValueError: No start of json char found\n",
"stderr": "Traceback (most recent call last):\n File
\"/tmp/ansible_OnIK1e/ansible_module_gce.py\", line 750, in \n
main()\n File \"/tmp/ansible_OnIK1e/ansible_module_gce.py\", line
712, in main\n module, gce, inames, number)\n File
\"/tmp/ansible_OnIK1e/ansible_module_gce.py\", line 524, in
create_instances\n instance, lc_machine_type, lc_image(),
**gce_args\n File \"/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/gce.py\",
line 3874, in create_node\n self.connection.async_request(request,
method='POST', data=node_data)\n File
\"/usr/local/lib/python2.7/dist-packages/libcloud/common/base.py\",
line 784, in async_request\n response = request(**kwargs)\n File
\"/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/gce.py\",
line 121, in request\n response = super(GCEConnection,
self).request(*args, **kwargs)\n File
\"/usr/local/lib/python2.7/dist-packages/libcloud/common/google.py\",
line 806, in request\n *args, **kwargs)\n File
\"/usr/local/lib/python2.7/dist-packages/libcloud/common/base.py\",
line 641, in request\n response = responseCls(**kwargs)\n File
\"/usr/local/lib/python2.7/dist-packages/libcloud/common/base.py\",
line 163, in init\n self.object = self.parse_body()\n File
\"/usr/local/lib/python2.7/dist-packages/libcloud/common/google.py\",
line 268, in parse_body\n raise GoogleBaseError(message,
self.status, code)\nlibcloud.common.google.GoogleBaseError: u\"The
zone 'projects/[projectid]/zones/europe-west1-d' does not have enough
resources available to fulfill the request. Try a different zone, or
try again later.\"\n", "stderr_lines": ["Traceback (most recent call
last):", " File \"/tmp/ansible_OnIK1e/ansible_module_gce.py\", line
750, in ", " main()", " File
\"/tmp/ansible_OnIK1e/ansible_module_gce.py\", line 712, in main", "
module, gce, inames, number)", " File
\"/tmp/ansible_OnIK1e/ansible_module_gce.py\", line 524, in
create_instances", " instance, lc_machine_type, lc_image(),
**gce_args", " File \"/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/gce.py\",
line 3874, in create_node", "
self.connection.async_request(request, method='POST',
data=node_data)", " File
\"/usr/local/lib/python2.7/dist-packages/libcloud/common/base.py\",
line 784, in async_request", " response = request(**kwargs)", "
File
\"/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/gce.py\",
line 121, in request", " response = super(GCEConnection,
self).request(*args, **kwargs)", " File
\"/usr/local/lib/python2.7/dist-packages/libcloud/common/google.py\",
line 806, in request", " *args, **kwargs)", " File
\"/usr/local/lib/python2.7/dist-packages/libcloud/common/base.py\",
line 641, in request", " response = responseCls(**kwargs)", " File
\"/usr/local/lib/python2.7/dist-packages/libcloud/common/base.py\",
line 163, in init", " self.object = self.parse_body()", " File
\"/usr/local/lib/python2.7/dist-packages/libcloud/common/google.py\",
line 268, in parse_body", " raise GoogleBaseError(message,
self.status, code)", "libcloud.common.google.GoogleBaseError: u\"The
zone 'projects/[projectid]/zones/europe-west1-d' does not have enough
resources available to fulfill the request. Try a different zone, or
try again later.\""]}
to retry, use: --limit @/usr/local/airflow/ansible/playbooks/elasticsearch-link-cluster-create.retry
错误消息并未显示配额错误,而是区域资源问题,我建议您尝试新区域。
引用自documentation:
Even if you have a regional quota, it is possible that a resource might not be available in a specific zone. For example, you might have quota in region us-central1 to create VM instances, but might not be able to create VM instances in the zone us-central1-a if the zone is depleted. In such cases, try creating the same resource in another zone, such as us-central1-f.
因此,在创建脚本时,您应该考虑到这种可能性,即使这种可能性并不常见。
这个问题在 preentible 个实例中更加突出,因为:
Preemptible instances are finite Compute Engine resources, so they might not always be available. [...] these instances if it requires access to those resources for other tasks. Preemptible instances are excess Compute Engine capacity so their availability varies with usage.
更新
要仔细检查我在说什么,您可以尝试保留 preentible 标志并更改区域以确保脚本正常工作并且它是在晚上发生的缺货(并且因为在白天它工作这个应该是这样的)。
- 如果问题真的是可用性 -|您可能会考虑启动 preentible 实例,如果不可用,捕获错误,然后依赖普通实例或依赖其他区域 |-
更新2
正如我承诺的那样,我代表您创建了功能请求,您可以在 public 跟踪器上关注更新。
我建议您启动它以便通过电子邮件接收更新:
一年多以来,我一直在通过 ansible 脚本在 GCP 中部署集群,但突然间,我的一个脚本一直给我这个错误:
libcloud.common.google.GoogleBaseError: u\"The zone 'projects/[project]/zones/europe-west1-d' does not have enough resources available to fulfill the request. Try a different zone, or try again later.
明显的原因是我没有足够的资源,但没有太多改变,配额看起来不错:
ansible 脚本本身要求不高。 我正在使用 100GB SSD 创建 3 个 n1-standard-4 实例。 请参阅下面的脚本片段:
tasks:
- name: create boot disks
gce_pd:
disk_type: pd-ssd
image: "debian-9-stretch-v20171025"
name: "{{ item.node }}-disk"
size_gb: 100
state: present
zone: "europe-west1-d"
service_account_email: "{{ service_account_email }}"
credentials_file: "{{ credentials_file }}"
project_id: "{{ project_id }}"
with_items: "{{nodes}}"
async: 3600
poll: 2
- name: create instances
gce:
instance_names: "{{item.node}}"
zone: "europe-west1-d"
machine_type: "n1-standard-4"
preemptible: "{{ false if item.num == '0' else true }}"
disk_auto_delete: true
disks:
- name: "{{ item.node }}-disk"
mode: READ_WRITE
state: present
service_account_email: "{{ service_account_email }}"
service_account_permissions: "compute-rw"
credentials_file: "{{ credentials_file }}"
project_id: "{{ project_id }}"
tags: "elasticsearch"
register: gce_raw_results
with_items: "{{nodes}}"
async: 3600
poll: 2
更新 1:
- 服务账号是整个项目的编辑者。所以正确的问题似乎不太可能。
- 它从 2018 年 3 月 24 日开始发生。从那以后的每个晚上。因此,如果这是一个 'out of stock' 问题,那将是非常巧合的,对吧? 此外,到目前为止,我一整天都在 运行 这个脚本,但大部分时间它都失败了(成功见下文)。
- 我测试了几次,可能与实例上的 'preemptible' 标志有关。 (我启动了 3 个节点,但至少第一个必须保持工作状态)=>
preemptible: "{{ false if item.num == '0' else true }}"
如果我关闭 preemptible (false) 然后它运行顺利。 'workaround' 似乎只是不使用可抢占实例,但这曾经工作了一年而没有失败一次。有什么改变吗? GCP 的 API 有变化吗? ansible gce 没有实现这些更改吗?
完整的错误是:
TASK [Gathering Facts] ****************************************************************************************************************************************************************************************************************************************************************************************************** ok: [localhost]
TASK [create boot disks] **************************************************************************************************************************************************************************************************************************************************************************************************** changed: [localhost] => (item={u'node': u'elasticsearch-link-0', u'ip_field': u'private_ip', u'zone': u'europe-west1-d', u'cluster_name': u'elasticsearch-link', u'num': u'0', u'machine_type': u'n1-standard-4', u'project_id': u'[projectid]'}) changed: [localhost] => (item={u'node': u'elasticsearch-link-1', u'ip_field': u'private_ip', u'zone': u'europe-west1-d', u'cluster_name': u'elasticsearch-link', u'num': u'1', u'machine_type': u'n1-standard-4', u'project_id': u'[projectid]'}) ok: [localhost] => (item={u'node': u'elasticsearch-link-2', u'ip_field': u'private_ip', u'zone': u'europe-west1-d', u'cluster_name': u'elasticsearch-link', u'num': u'2', u'machine_type': u'n1-standard-4', u'project_id': u'[projectid]'})
TASK [create instances] ***************************************************************************************************************************************************************************************************************************************************************************************************** changed: [localhost] => (item={u'node': u'elasticsearch-link-0', u'ip_field': u'private_ip', u'zone': u'europe-west1-d', u'cluster_name': u'elasticsearch-link', u'num': u'0', u'machine_type': u'n1-standard-4', u'project_id': u'[projectid]'}) changed: [localhost] => (item={u'node': u'elasticsearch-link-1', u'ip_field': u'private_ip', u'zone': u'europe-west1-d', u'cluster_name': u'elasticsearch-link', u'num': u'1', u'machine_type': u'n1-standard-4', u'project_id': u'[projectid]'}) failed: [localhost] (item={u'node': u'elasticsearch-link-2', u'ip_field': u'private_ip', u'zone': u'europe-west1-d', u'cluster_name': u'elasticsearch-link', u'num': u'2', u'machine_type': u'n1-standard-4', u'project_id': u'[projectid]'}) => {"ansible_job_id": "371957735383.2688", "changed": false, "cmd": "/tmp/.ansible-airflow/ansible-tmp-1522742180.0-71790706749341/gce.py", "data": "", "failed": 1, "finished": 1, "item": {"cluster_name": "elasticsearch-link", "ip_field": "private_ip", "machine_type": "n1-standard-4", "node": "elasticsearch-link-2", "num": "2", "project_id": "[projectid]", "zone": "europe-west1-d"}, "msg": "Traceback (most recent call last):\n File \"/tmp/.ansible-airflow/ansible-tmp-1522742180.0-71790706749341/async_wrapper.py\", line 158, in _run_module\n (filtered_outdata, json_warnings) = _filter_non_json_lines(outdata)\n File \"/tmp/.ansible-airflow/ansible-tmp-1522742180.0-71790706749341/async_wrapper.py\", line 99, in _filter_non_json_lines\n raise ValueError('No start of json char found')\nValueError: No start of json char found\n", "stderr": "Traceback (most recent call last):\n File \"/tmp/ansible_OnIK1e/ansible_module_gce.py\", line 750, in \n main()\n File \"/tmp/ansible_OnIK1e/ansible_module_gce.py\", line 712, in main\n module, gce, inames, number)\n File \"/tmp/ansible_OnIK1e/ansible_module_gce.py\", line 524, in create_instances\n instance, lc_machine_type, lc_image(), **gce_args\n File \"/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/gce.py\", line 3874, in create_node\n self.connection.async_request(request, method='POST', data=node_data)\n File \"/usr/local/lib/python2.7/dist-packages/libcloud/common/base.py\", line 784, in async_request\n response = request(**kwargs)\n File \"/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/gce.py\", line 121, in request\n response = super(GCEConnection, self).request(*args, **kwargs)\n File \"/usr/local/lib/python2.7/dist-packages/libcloud/common/google.py\", line 806, in request\n *args, **kwargs)\n File \"/usr/local/lib/python2.7/dist-packages/libcloud/common/base.py\", line 641, in request\n response = responseCls(**kwargs)\n File \"/usr/local/lib/python2.7/dist-packages/libcloud/common/base.py\", line 163, in init\n self.object = self.parse_body()\n File \"/usr/local/lib/python2.7/dist-packages/libcloud/common/google.py\", line 268, in parse_body\n raise GoogleBaseError(message, self.status, code)\nlibcloud.common.google.GoogleBaseError: u\"The zone 'projects/[projectid]/zones/europe-west1-d' does not have enough resources available to fulfill the request. Try a different zone, or try again later.\"\n", "stderr_lines": ["Traceback (most recent call last):", " File \"/tmp/ansible_OnIK1e/ansible_module_gce.py\", line 750, in ", " main()", " File \"/tmp/ansible_OnIK1e/ansible_module_gce.py\", line 712, in main", "
module, gce, inames, number)", " File \"/tmp/ansible_OnIK1e/ansible_module_gce.py\", line 524, in create_instances", " instance, lc_machine_type, lc_image(), **gce_args", " File \"/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/gce.py\", line 3874, in create_node", "
self.connection.async_request(request, method='POST', data=node_data)", " File \"/usr/local/lib/python2.7/dist-packages/libcloud/common/base.py\", line 784, in async_request", " response = request(**kwargs)", " File \"/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/gce.py\", line 121, in request", " response = super(GCEConnection, self).request(*args, **kwargs)", " File \"/usr/local/lib/python2.7/dist-packages/libcloud/common/google.py\", line 806, in request", " *args, **kwargs)", " File \"/usr/local/lib/python2.7/dist-packages/libcloud/common/base.py\", line 641, in request", " response = responseCls(**kwargs)", " File \"/usr/local/lib/python2.7/dist-packages/libcloud/common/base.py\", line 163, in init", " self.object = self.parse_body()", " File \"/usr/local/lib/python2.7/dist-packages/libcloud/common/google.py\", line 268, in parse_body", " raise GoogleBaseError(message, self.status, code)", "libcloud.common.google.GoogleBaseError: u\"The zone 'projects/[projectid]/zones/europe-west1-d' does not have enough resources available to fulfill the request. Try a different zone, or try again later.\""]} to retry, use: --limit @/usr/local/airflow/ansible/playbooks/elasticsearch-link-cluster-create.retry
错误消息并未显示配额错误,而是区域资源问题,我建议您尝试新区域。
引用自documentation:
Even if you have a regional quota, it is possible that a resource might not be available in a specific zone. For example, you might have quota in region us-central1 to create VM instances, but might not be able to create VM instances in the zone us-central1-a if the zone is depleted. In such cases, try creating the same resource in another zone, such as us-central1-f.
因此,在创建脚本时,您应该考虑到这种可能性,即使这种可能性并不常见。
这个问题在 preentible 个实例中更加突出,因为:
Preemptible instances are finite Compute Engine resources, so they might not always be available. [...] these instances if it requires access to those resources for other tasks. Preemptible instances are excess Compute Engine capacity so their availability varies with usage.
更新
要仔细检查我在说什么,您可以尝试保留 preentible 标志并更改区域以确保脚本正常工作并且它是在晚上发生的缺货(并且因为在白天它工作这个应该是这样的)。
- 如果问题真的是可用性 -|您可能会考虑启动 preentible 实例,如果不可用,捕获错误,然后依赖普通实例或依赖其他区域 |-
更新2
正如我承诺的那样,我代表您创建了功能请求,您可以在 public 跟踪器上关注更新。 我建议您启动它以便通过电子邮件接收更新: