Kubernetes 中的 Ansible AWX RabbitMQ 容器无法使用 nxdomain 从 k8s 获取节点
Ansible AWX RabbitMQ container in Kubernetes Failed to get nodes from k8s with nxdomain
我正在尝试在我的 Kubernetes 集群上安装 Ansible AWX,但 RabbitMQ 容器抛出 "Failed to get nodes from k8s" 错误。
以下是我使用的平台版本
[node1 ~]# kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.5",
GitCommit:"f01a2bf98249a4db383560443a59bed0c13575df", GitTreeState:"clean",
BuildDate:"2018-03-19T15:50:45Z", GoVersion:"go1.9.3", Compiler:"gc",
Platform:"linux/amd64"}
Kubernetes 是通过 kubespray 剧本 v2.5.0 部署的,所有服务和 pods 都已启动并且 运行。 (CoreDNS、Weave、IPtables)
我正在通过 1.0.6 版本部署 AWX,使用 awx_web 和 awx_task 的 1.0.6 映像。
我正在使用 v10.4 的外部 PostgreSQL 数据库,并已验证数据库中的表是由 awx 创建的。
我尝试过的故障排除步骤。
- 我尝试将带有 etcd pod 的 AWX 1.0.5 部署到同一个集群,它按预期工作
- 我在同一个 k8s 集群中部署了一个独立的 RabbitMQ cluster 试图尽可能模仿 AWX rabbit 部署,它与 rabbit_peer_discovery_k8s 后端一起工作。
- 我试过为 AWX 1.0.6 修改一些 rabbitmq.conf,但不幸的是它一直在抛出同样的错误。
- 我已验证 /etc/resolv.conf 文件具有 kubernetes.default.svc.cluster.local 条目
集群信息
[node1 ~]# kubectl get all -n awx
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/awx 1 1 1 0 38m
NAME DESIRED CURRENT READY AGE
rs/awx-654f7fc84c 1 1 0 38m
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/awx 1 1 1 0 38m
NAME DESIRED CURRENT READY AGE
rs/awx-654f7fc84c 1 1 0 38m
NAME READY STATUS RESTARTS AGE
po/awx-654f7fc84c-9ppqb 3/4 CrashLoopBackOff 11 38m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/awx-rmq-mgmt ClusterIP 10.233.10.146 <none> 15672/TCP 1d
svc/awx-web-svc NodePort 10.233.3.75 <none> 80:31700/TCP 1d
svc/rabbitmq NodePort 10.233.37.33 <none> 15672:30434/TCP,5672:31962/TCP 1d
AWX RabbitMQ 错误日志
[node1 ~]# kubectl logs -n awx awx-654f7fc84c-9ppqb awx-rabbit
2018-07-09 14:47:37.464 [info] <0.33.0> Application lager started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.767 [info] <0.33.0> Application os_mon started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.767 [info] <0.33.0> Application crypto started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.768 [info] <0.33.0> Application cowlib started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.768 [info] <0.33.0> Application xmerl started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.851 [info] <0.33.0> Application mnesia started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.851 [info] <0.33.0> Application recon started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.852 [info] <0.33.0> Application jsx started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.852 [info] <0.33.0> Application asn1 started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.852 [info] <0.33.0> Application public_key started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.897 [info] <0.33.0> Application ssl started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.901 [info] <0.33.0> Application ranch started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.901 [info] <0.33.0> Application ranch_proxy_protocol started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.901 [info] <0.33.0> Application rabbit_common started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.907 [info] <0.33.0> Application amqp_client started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.909 [info] <0.33.0> Application cowboy started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.957 [info] <0.33.0> Application inets started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.964 [info] <0.193.0>
Starting RabbitMQ 3.7.4 on Erlang 20.1.7
Copyright (C) 2007-2018 Pivotal Software, Inc.
Licensed under the MPL. See http://www.rabbitmq.com/
## ##
## ## RabbitMQ 3.7.4. Copyright (C) 2007-2018 Pivotal Software, Inc.
########## Licensed under the MPL. See http://www.rabbitmq.com/
###### ##
########## Logs: <stdout>
Starting broker...
2018-07-09 14:47:37.982 [info] <0.193.0>
node : rabbit@10.233.120.5
home dir : /var/lib/rabbitmq
config file(s) : /etc/rabbitmq/rabbitmq.conf
cookie hash : at619UOZzsenF44tSK3ulA==
log(s) : <stdout>
database dir : /var/lib/rabbitmq/mnesia/rabbit@10.233.120.5
2018-07-09 14:47:39.649 [info] <0.201.0> Memory high watermark set to 11998 MiB (12581714329 bytes) of 29997 MiB (31454285824 bytes) total
2018-07-09 14:47:39.652 [info] <0.203.0> Enabling free disk space monitoring
2018-07-09 14:47:39.653 [info] <0.203.0> Disk free limit set to 50MB
2018-07-09 14:47:39.658 [info] <0.205.0> Limiting to approx 1048476 file handles (943626 sockets)
2018-07-09 14:47:39.658 [info] <0.206.0> FHC read buffering: OFF
2018-07-09 14:47:39.658 [info] <0.206.0> FHC write buffering: ON
2018-07-09 14:47:39.660 [info] <0.193.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@10.233.120.5 is empty. Assuming we need to join an existing cluster or initialise from scratch...
2018-07-09 14:47:39.660 [info] <0.193.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2018-07-09 14:47:39.660 [info] <0.193.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2018-07-09 14:47:39.660 [info] <0.193.0> Peer discovery backend does not support locking, falling back to randomized delay
2018-07-09 14:47:39.660 [info] <0.193.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2018-07-09 14:47:39.665 [info] <0.193.0> Failed to get nodes from k8s - {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},
{inet,[inet],nxdomain}]}
2018-07-09 14:47:39.665 [error] <0.192.0> CRASH REPORT Process <0.192.0> with 0 neighbours exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164 in application_master:init/4 line 134
2018-07-09 14:47:39.666 [info] <0.33.0> Application rabbit exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,\"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n {inet,[inet],nxdomain}]}\"}},[{rabbit_mnesia,init_from_config,0,[{file,\"src/rabbit_mnesia.erl\"},{line,164}]},{rabbit_mnesia,init_with_lock,3,[{file,\"src/rabbit_mnesia.erl\"},{line,144}]},{rabbit_mnesia,init,0,[{file,\"src/rabbit_mnesia.erl\"},{line,111}]},{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,run_step,2,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,'-run_boot_steps/1-lc$^0/1-0-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit_boot_steps,run_boot_steps,1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit,start,2,[{file,\"src/rabbit.erl\"},{line,793}]}]}}}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,"{failed_connect,[{to_address,{\"kubernetes.defau
Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done
Kubernetes API 服务
[node1 ~]# kubectl describe service kubernetes
Name: kubernetes
Namespace: default
Labels: component=apiserver
provider=kubernetes
Annotations: <none>
Selector: <none>
Type: ClusterIP
IP: 10.233.0.1
Port: https 443/TCP
TargetPort: 6443/TCP
Endpoints: 10.237.34.19:6443,10.237.34.21:6443
Session Affinity: ClientIP
Events: <none>
来自同一 kubernetes 集群中的 busybox 的 nslookup
[node2 ~]# kubectl exec -it busybox -- sh
/ # nslookup kubernetes.default.svc.cluster.local
Server: 10.233.0.3
Address 1: 10.233.0.3 coredns.kube-system.svc.cluster.local
Name: kubernetes.default.svc.cluster.local
Address 1: 10.233.0.1 kubernetes.default.svc.cluster.local
如果我遗漏了任何有助于故障排除的信息,请告诉我。
我相信解决方案是省略the explicit kubernetes host。我想不出有什么好的理由需要 指定 集群内部的 kubernetes api 主机。
如果出于某种可怕的原因 RMQ 插件需要它,请尝试交换 Service
IP(假设您的主 SSL 证书在 SAN 列表中有其 Service
IP)。
至于为什么做这么傻的事情,我能想到的唯一好的理由是RMQ PodSpec
不知何故得到了dnsPolicy
不是 ClusterFirst
。如果你真的想对 RMQ Pod 进行故障排除,那么你可以先提供一个显式 command:
到 运行 一些调试 bash 命令,以便在启动时询问容器的状态,并且然后 exec /launch.sh
恢复启动 RMQ (as they do)
我正在尝试在我的 Kubernetes 集群上安装 Ansible AWX,但 RabbitMQ 容器抛出 "Failed to get nodes from k8s" 错误。
以下是我使用的平台版本
[node1 ~]# kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.5",
GitCommit:"f01a2bf98249a4db383560443a59bed0c13575df", GitTreeState:"clean",
BuildDate:"2018-03-19T15:50:45Z", GoVersion:"go1.9.3", Compiler:"gc",
Platform:"linux/amd64"}
Kubernetes 是通过 kubespray 剧本 v2.5.0 部署的,所有服务和 pods 都已启动并且 运行。 (CoreDNS、Weave、IPtables)
我正在通过 1.0.6 版本部署 AWX,使用 awx_web 和 awx_task 的 1.0.6 映像。
我正在使用 v10.4 的外部 PostgreSQL 数据库,并已验证数据库中的表是由 awx 创建的。
我尝试过的故障排除步骤。
- 我尝试将带有 etcd pod 的 AWX 1.0.5 部署到同一个集群,它按预期工作
- 我在同一个 k8s 集群中部署了一个独立的 RabbitMQ cluster 试图尽可能模仿 AWX rabbit 部署,它与 rabbit_peer_discovery_k8s 后端一起工作。
- 我试过为 AWX 1.0.6 修改一些 rabbitmq.conf,但不幸的是它一直在抛出同样的错误。
- 我已验证 /etc/resolv.conf 文件具有 kubernetes.default.svc.cluster.local 条目
集群信息
[node1 ~]# kubectl get all -n awx
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/awx 1 1 1 0 38m
NAME DESIRED CURRENT READY AGE
rs/awx-654f7fc84c 1 1 0 38m
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/awx 1 1 1 0 38m
NAME DESIRED CURRENT READY AGE
rs/awx-654f7fc84c 1 1 0 38m
NAME READY STATUS RESTARTS AGE
po/awx-654f7fc84c-9ppqb 3/4 CrashLoopBackOff 11 38m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/awx-rmq-mgmt ClusterIP 10.233.10.146 <none> 15672/TCP 1d
svc/awx-web-svc NodePort 10.233.3.75 <none> 80:31700/TCP 1d
svc/rabbitmq NodePort 10.233.37.33 <none> 15672:30434/TCP,5672:31962/TCP 1d
AWX RabbitMQ 错误日志
[node1 ~]# kubectl logs -n awx awx-654f7fc84c-9ppqb awx-rabbit
2018-07-09 14:47:37.464 [info] <0.33.0> Application lager started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.767 [info] <0.33.0> Application os_mon started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.767 [info] <0.33.0> Application crypto started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.768 [info] <0.33.0> Application cowlib started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.768 [info] <0.33.0> Application xmerl started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.851 [info] <0.33.0> Application mnesia started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.851 [info] <0.33.0> Application recon started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.852 [info] <0.33.0> Application jsx started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.852 [info] <0.33.0> Application asn1 started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.852 [info] <0.33.0> Application public_key started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.897 [info] <0.33.0> Application ssl started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.901 [info] <0.33.0> Application ranch started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.901 [info] <0.33.0> Application ranch_proxy_protocol started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.901 [info] <0.33.0> Application rabbit_common started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.907 [info] <0.33.0> Application amqp_client started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.909 [info] <0.33.0> Application cowboy started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.957 [info] <0.33.0> Application inets started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.964 [info] <0.193.0>
Starting RabbitMQ 3.7.4 on Erlang 20.1.7
Copyright (C) 2007-2018 Pivotal Software, Inc.
Licensed under the MPL. See http://www.rabbitmq.com/
## ##
## ## RabbitMQ 3.7.4. Copyright (C) 2007-2018 Pivotal Software, Inc.
########## Licensed under the MPL. See http://www.rabbitmq.com/
###### ##
########## Logs: <stdout>
Starting broker...
2018-07-09 14:47:37.982 [info] <0.193.0>
node : rabbit@10.233.120.5
home dir : /var/lib/rabbitmq
config file(s) : /etc/rabbitmq/rabbitmq.conf
cookie hash : at619UOZzsenF44tSK3ulA==
log(s) : <stdout>
database dir : /var/lib/rabbitmq/mnesia/rabbit@10.233.120.5
2018-07-09 14:47:39.649 [info] <0.201.0> Memory high watermark set to 11998 MiB (12581714329 bytes) of 29997 MiB (31454285824 bytes) total
2018-07-09 14:47:39.652 [info] <0.203.0> Enabling free disk space monitoring
2018-07-09 14:47:39.653 [info] <0.203.0> Disk free limit set to 50MB
2018-07-09 14:47:39.658 [info] <0.205.0> Limiting to approx 1048476 file handles (943626 sockets)
2018-07-09 14:47:39.658 [info] <0.206.0> FHC read buffering: OFF
2018-07-09 14:47:39.658 [info] <0.206.0> FHC write buffering: ON
2018-07-09 14:47:39.660 [info] <0.193.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@10.233.120.5 is empty. Assuming we need to join an existing cluster or initialise from scratch...
2018-07-09 14:47:39.660 [info] <0.193.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2018-07-09 14:47:39.660 [info] <0.193.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2018-07-09 14:47:39.660 [info] <0.193.0> Peer discovery backend does not support locking, falling back to randomized delay
2018-07-09 14:47:39.660 [info] <0.193.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2018-07-09 14:47:39.665 [info] <0.193.0> Failed to get nodes from k8s - {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},
{inet,[inet],nxdomain}]}
2018-07-09 14:47:39.665 [error] <0.192.0> CRASH REPORT Process <0.192.0> with 0 neighbours exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164 in application_master:init/4 line 134
2018-07-09 14:47:39.666 [info] <0.33.0> Application rabbit exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,\"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n {inet,[inet],nxdomain}]}\"}},[{rabbit_mnesia,init_from_config,0,[{file,\"src/rabbit_mnesia.erl\"},{line,164}]},{rabbit_mnesia,init_with_lock,3,[{file,\"src/rabbit_mnesia.erl\"},{line,144}]},{rabbit_mnesia,init,0,[{file,\"src/rabbit_mnesia.erl\"},{line,111}]},{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,run_step,2,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,'-run_boot_steps/1-lc$^0/1-0-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit_boot_steps,run_boot_steps,1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit,start,2,[{file,\"src/rabbit.erl\"},{line,793}]}]}}}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,"{failed_connect,[{to_address,{\"kubernetes.defau
Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done
Kubernetes API 服务
[node1 ~]# kubectl describe service kubernetes
Name: kubernetes
Namespace: default
Labels: component=apiserver
provider=kubernetes
Annotations: <none>
Selector: <none>
Type: ClusterIP
IP: 10.233.0.1
Port: https 443/TCP
TargetPort: 6443/TCP
Endpoints: 10.237.34.19:6443,10.237.34.21:6443
Session Affinity: ClientIP
Events: <none>
来自同一 kubernetes 集群中的 busybox 的 nslookup
[node2 ~]# kubectl exec -it busybox -- sh
/ # nslookup kubernetes.default.svc.cluster.local
Server: 10.233.0.3
Address 1: 10.233.0.3 coredns.kube-system.svc.cluster.local
Name: kubernetes.default.svc.cluster.local
Address 1: 10.233.0.1 kubernetes.default.svc.cluster.local
如果我遗漏了任何有助于故障排除的信息,请告诉我。
我相信解决方案是省略the explicit kubernetes host。我想不出有什么好的理由需要 指定 集群内部的 kubernetes api 主机。
如果出于某种可怕的原因 RMQ 插件需要它,请尝试交换 Service
IP(假设您的主 SSL 证书在 SAN 列表中有其 Service
IP)。
至于为什么做这么傻的事情,我能想到的唯一好的理由是RMQ PodSpec
不知何故得到了dnsPolicy
不是 ClusterFirst
。如果你真的想对 RMQ Pod 进行故障排除,那么你可以先提供一个显式 command:
到 运行 一些调试 bash 命令,以便在启动时询问容器的状态,并且然后 exec /launch.sh
恢复启动 RMQ (as they do)