kubeadm 工作节点上 pod 运行 的名称解析暂时失败

Temporary failure in name resolution for pod running on kubeadm worker node

I 运行 Kafka 在 VMWare 上的 Kubernetes 集群中,带有一个 ControlPlane 和一个工作节点。从 ControlPlane 节点我的客户端可以与 Kafka 通信,但是从我的工作节点这最终会出现这个错误

   %3|1638529687.405|FAIL|apollo-prototype-765f4d8bcf-bjpf4#producer-2| [thrd:sasl_plaintext://my-cluster-kafka-bootstrap:9092/bootstrap]: sasl_plaintext://my-cluster-kafka-bootstrap:9092/bootstrap: Failed to resolve 'my-cluster-kafka-bootstrap:9092': Temporary failure in name resolution (after 20016ms in state CONNECT, 2 identical error(s) suppressed)
   %3|1638529687.406|ERROR|apollo-prototype-765f4d8bcf-bjpf4#producer-2| [thrd:app]: apollo-prototype-765f4d8bcf-bjpf4#producer-2: sasl_plaintext://my-cluster-kafka-bootstrap:9092/bootstrap: Failed to resolve 'my-cluster-kafka-bootstrap:9092': Temporary failure in name resolution (after 20016ms in state CONNECT, 2 identical error(s) suppressed)

这是我的 Kafka 集群清单(使用 Strimzi)

listeners:
  - name: plain
    port: 9092
    type: internal
    tls: false
    authentication:
      type: scram-sha-512
  - name: external
    port: 9094
    type: ingress
    tls: true
    authentication:
      type: scram-sha-512
    configuration:
      class: nginx
      bootstrap:
        host: localb.kafka.xxx.com
      brokers:
      - broker: 0
        host: local.kafka.xxx.com

值得一提的是,完全相同的配置,当我 运行 在云中工作时完美无缺。

Telnetnslookup(来自两个节点)抛出错误。 CoreDNS 日志甚至没有提到这个错误。 两个节点上的防火墙也被禁用。

你能帮帮我吗?谢谢!


更新:解决方案 Calico Pod(来自工作节点)抱怨 bird: Netlink: Network is down,即使它没有崩溃

2021-12-03 09:39:58.051 [INFO][90] felix/int_dataplane.go 1539: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"tunl0", Addrs:set.mapSet{}}
2021-12-03 09:39:58.051 [INFO][90] felix/hostip_mgr.go 85: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"tunl0", Addrs:set.mapSet{}}
2021-12-03 09:39:58.052 [INFO][90] felix/ipsets.go 130: Queueing IP set for creation family="inet" setID="this-host" setType="hash:ip"
2021-12-03 09:39:58.057 [INFO][90] felix/ipsets.go 785: Doing full IP set rewrite family="inet" numMembersInPendingReplace=3 setID="this-host"
2021-12-03 09:39:58.059 [INFO][90] felix/int_dataplane.go 1036: Linux interface state changed. ifIndex=13 ifaceName="tunl0" state="down"
2021-12-03 09:39:58.082 [INFO][90] felix/int_dataplane.go 1521: Received interface update msg=&intdataplane.ifaceUpdate{Name:"tunl0", State:"down", Index:13}
bird: Netlink: Network is down

Here 是我所做的,它非常有效!

The fault is caused by the different ipvs modules loaded by the node. I configured the ipip module for the new node, but the old node did not load the ipip module, which caused the calico exception. Delete the ipip module to return to normal.

[root@k8s-node236-232 ~]# lsmod  | grep ipip
ipip                   16384  0 
tunnel4                16384  1 ipip
ip_tunnel              24576  1 ipip
[root@k8s-node236-232 ~]# modprobe -r ipip
[root@k8s-node236-232 ~]# lsmod  | grep ipip

Calico Pod(来自工作节点)抱怨 bird: Netlink: Network is down,即使它没有崩溃

2021-12-03 09:39:58.051 [INFO][90] felix/int_dataplane.go 1539: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"tunl0", Addrs:set.mapSet{}}
2021-12-03 09:39:58.051 [INFO][90] felix/hostip_mgr.go 85: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"tunl0", Addrs:set.mapSet{}}
2021-12-03 09:39:58.052 [INFO][90] felix/ipsets.go 130: Queueing IP set for creation family="inet" setID="this-host" setType="hash:ip"
2021-12-03 09:39:58.057 [INFO][90] felix/ipsets.go 785: Doing full IP set rewrite family="inet" numMembersInPendingReplace=3 setID="this-host"
2021-12-03 09:39:58.059 [INFO][90] felix/int_dataplane.go 1036: Linux interface state changed. ifIndex=13 ifaceName="tunl0" state="down"
2021-12-03 09:39:58.082 [INFO][90] felix/int_dataplane.go 1521: Received interface update msg=&intdataplane.ifaceUpdate{Name:"tunl0", State:"down", Index:13}
bird: Netlink: Network is down

Here 是我所做的,效果非常好!

The fault is caused by the different ipvs modules loaded by the node. I configured the ipip module for the new node, but the old node did not load the ipip module, which caused the calico exception. Delete the ipip module to return to normal.

[root@k8s-node236-232 ~]# lsmod  | grep ipip
ipip                   16384  0 
tunnel4                16384  1 ipip
ip_tunnel              24576  1 ipip
[root@k8s-node236-232 ~]# modprobe -r ipip
[root@k8s-node236-232 ~]# lsmod  | grep ipip