不正确的 cni 安装阻止 coredns pods 启动
Improper cni install preventing coredns pods from starting
刚刚使用 kubeadm v1.15.0 安装了一个主集群。然而,coredns 似乎停留在挂起模式:
coredns-5c98db65d4-4pm65 0/1 Pending 0 2m17s <none> <none> <none> <none>
coredns-5c98db65d4-55hcc 0/1 Pending 0 2m2s <none> <none> <none> <none>
以下是为广告连播显示的内容:
kubectl describe pods coredns-5c98db65d4-4pm65 --namespace=kube-system
Name: coredns-5c98db65d4-4pm65
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: <none>
Labels: k8s-app=kube-dns
pod-template-hash=5c98db65d4
Annotations: <none>
Status: Pending
IP:
Controlled By: ReplicaSet/coredns-5c98db65d4
Containers:
coredns:
Image: k8s.gcr.io/coredns:1.3.1
Ports: 53/UDP, 53/TCP, 9153/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
-conf
/etc/coredns/Corefile
Limits:
memory: 170Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
Readiness: http-get http://:8080/health delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/etc/coredns from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from coredns-token-5t2wn (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: coredns
Optional: false
coredns-token-5t2wn:
Type: Secret (a volume populated by a Secret)
SecretName: coredns-token-5t2wn
Optional: false
QoS Class: Burstable
Node-Selectors: beta.kubernetes.io/os=linux
Tolerations: CriticalAddonsOnly
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 61s (x4 over 5m21s) default-scheduler 0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate.
我删除了主节点上的污点,但无济于事。我不应该能够毫无问题地创建一个单节点主机吗?我知道如果不移除污点就不可能在 master 上调度 pods,但这很奇怪。
我尝试添加最新的 calico cni,也没有用。
我得到以下 运行 journalctl(systemctl 显示没有错误):
sudo journalctl -xn --unit kubelet.service
[sudo] password for gms:
-- Logs begin at Fri 2019-07-12 04:31:34 CDT, end at Tue 2019-07-16 16:58:17 CDT. --
Jul 16 16:57:54 thalia0.ahc.umn.edu kubelet[11250]: E0716 16:57:54.122355 11250 kubelet.go:2169] Container runtime network not ready: NetworkReady=false reason:NetworkPl
Jul 16 16:57:54 thalia0.ahc.umn.edu kubelet[11250]: W0716 16:57:54.400606 11250 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
Jul 16 16:57:59 thalia0.ahc.umn.edu kubelet[11250]: E0716 16:57:59.124863 11250 kubelet.go:2169] Container runtime network not ready: NetworkReady=false reason:NetworkPl
Jul 16 16:57:59 thalia0.ahc.umn.edu kubelet[11250]: W0716 16:57:59.400924 11250 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
Jul 16 16:58:04 thalia0.ahc.umn.edu kubelet[11250]: E0716 16:58:04.127120 11250 kubelet.go:2169] Container runtime network not ready: NetworkReady=false reason:NetworkPl
Jul 16 16:58:04 thalia0.ahc.umn.edu kubelet[11250]: W0716 16:58:04.401266 11250 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
Jul 16 16:58:09 thalia0.ahc.umn.edu kubelet[11250]: E0716 16:58:09.129287 11250 kubelet.go:2169] Container runtime network not ready: NetworkReady=false reason:NetworkPl
Jul 16 16:58:09 thalia0.ahc.umn.edu kubelet[11250]: W0716 16:58:09.401520 11250 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
Jul 16 16:58:14 thalia0.ahc.umn.edu kubelet[11250]: E0716 16:58:14.133059 11250 kubelet.go:2169] Container runtime network not ready: NetworkReady=false reason:NetworkPl
Jul 16 16:58:14 thalia0.ahc.umn.edu kubelet[11250]: W0716 16:58:14.402008 11250 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
确实,当我查看 /etc/cni/net.d
时,那里什么也没有 -> 是的,我 运行 kubectl apply -f https://docs.projectcalico.org/v3.8/manifests/calico.yaml
...这是我应用它时的输出:
configmap/calico-config created
customresourcedefinition.apiextensions.k8s.io/felixconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamblocks.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/blockaffinities.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamhandles.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamconfigs.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/bgppeers.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/bgpconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ippools.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/hostendpoints.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/clusterinformations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworksets.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networksets.crd.projectcalico.org created
clusterrole.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrolebinding.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrole.rbac.authorization.k8s.io/calico-node created
clusterrolebinding.rbac.authorization.k8s.io/calico-node created
daemonset.apps/calico-node created
serviceaccount/calico-node created
deployment.apps/calico-kube-controllers created
serviceaccount/calico-kube-controllers created
我在 calico-node 的 pod 上 运行 以下内容,卡在以下状态:
calico-node-tcfhw 0/1 Init:0/3 0 11m 10.32.3.158
describe pods calico-node-tcfhw --namespace=kube-system
Name: calico-node-tcfhw
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: thalia0.ahc.umn.edu/10.32.3.158
Start Time: Tue, 16 Jul 2019 18:08:25 -0500
Labels: controller-revision-hash=844ddd97c6
k8s-app=calico-node
pod-template-generation=1
Annotations: scheduler.alpha.kubernetes.io/critical-pod:
Status: Pending
IP: 10.32.3.158
Controlled By: DaemonSet/calico-node
Init Containers:
upgrade-ipam:
Container ID: docker://1e1bf9e65cb182656f6f06a1bb8291237562f0f5a375e557a454942e81d32063
Image: calico/cni:v3.8.0
Image ID: docker-pullable://docker.io/calico/cni@sha256:decba0501ab0658e6e7da2f5625f1eabb8aba5690f9206caba3bf98caca5094c
Port: <none>
Host Port: <none>
Command:
/opt/cni/bin/calico-ipam
-upgrade
State: Running
Started: Tue, 16 Jul 2019 18:08:26 -0500
Ready: False
Restart Count: 0
Environment:
KUBERNETES_NODE_NAME: (v1:spec.nodeName)
CALICO_NETWORKING_BACKEND: <set to the key 'calico_backend' of config map 'calico-config'> Optional: false
Mounts:
/host/opt/cni/bin from cni-bin-dir (rw)
/var/lib/cni/networks from host-local-net-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-b9c6p (ro)
install-cni:
Container ID:
Image: calico/cni:v3.8.0
Image ID:
Port: <none>
Host Port: <none>
Command:
/install-cni.sh
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
CNI_CONF_NAME: 10-calico.conflist
CNI_NETWORK_CONFIG: <set to the key 'cni_network_config' of config map 'calico-config'> Optional: false
KUBERNETES_NODE_NAME: (v1:spec.nodeName)
CNI_MTU: <set to the key 'veth_mtu' of config map 'calico-config'> Optional: false
SLEEP: false
Mounts:
/host/etc/cni/net.d from cni-net-dir (rw)
/host/opt/cni/bin from cni-bin-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-b9c6p (ro)
flexvol-driver:
Container ID:
Image: calico/pod2daemon-flexvol:v3.8.0
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/host/driver from flexvol-driver-host (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-b9c6p (ro)
Containers:
calico-node:
Container ID:
Image: calico/node:v3.8.0
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Requests:
cpu: 250m
Liveness: http-get http://localhost:9099/liveness delay=10s timeout=1s period=10s #success=1 #failure=6
Readiness: exec [/bin/calico-node -bird-ready -felix-ready] delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
DATASTORE_TYPE: kubernetes
WAIT_FOR_DATASTORE: true
NODENAME: (v1:spec.nodeName)
CALICO_NETWORKING_BACKEND: <set to the key 'calico_backend' of config map 'calico-config'> Optional: false
CLUSTER_TYPE: k8s,bgp
IP: autodetect
CALICO_IPV4POOL_IPIP: Always
FELIX_IPINIPMTU: <set to the key 'veth_mtu' of config map 'calico-config'> Optional: false
CALICO_IPV4POOL_CIDR: 192.168.0.0/16
CALICO_DISABLE_FILE_LOGGING: true
FELIX_DEFAULTENDPOINTTOHOSTACTION: ACCEPT
FELIX_IPV6SUPPORT: false
FELIX_LOGSEVERITYSCREEN: info
FELIX_HEALTHENABLED: true
Mounts:
/lib/modules from lib-modules (ro)
/run/xtables.lock from xtables-lock (rw)
/var/lib/calico from var-lib-calico (rw)
/var/run/calico from var-run-calico (rw)
/var/run/nodeagent from policysync (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-b9c6p (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
lib-modules:
Type: HostPath (bare host directory volume)
Path: /lib/modules
HostPathType:
var-run-calico:
Type: HostPath (bare host directory volume)
Path: /var/run/calico
HostPathType:
var-lib-calico:
Type: HostPath (bare host directory volume)
Path: /var/lib/calico
HostPathType:
xtables-lock:
Type: HostPath (bare host directory volume)
Path: /run/xtables.lock
HostPathType: FileOrCreate
cni-bin-dir:
Type: HostPath (bare host directory volume)
Path: /opt/cni/bin
HostPathType:
cni-net-dir:
Type: HostPath (bare host directory volume)
Path: /etc/cni/net.d
HostPathType:
host-local-net-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/cni/networks
HostPathType:
policysync:
Type: HostPath (bare host directory volume)
Path: /var/run/nodeagent
HostPathType: DirectoryOrCreate
flexvol-driver-host:
Type: HostPath (bare host directory volume)
Path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
HostPathType: DirectoryOrCreate
calico-node-token-b9c6p:
Type: Secret (a volume populated by a Secret)
SecretName: calico-node-token-b9c6p
Optional: false
QoS Class: Burstable
Node-Selectors: beta.kubernetes.io/os=linux
Tolerations: :NoSchedule
:NoExecute
CriticalAddonsOnly
node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/network-unavailable:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/pid-pressure:NoSchedule
node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unschedulable:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 9m15s default-scheduler Successfully assigned kube-system/calico-node-tcfhw to thalia0.ahc.umn.edu
Normal Pulled 9m14s kubelet, thalia0.ahc.umn.edu Container image "calico/cni:v3.8.0" already present on machine
Normal Created 9m14s kubelet, thalia0.ahc.umn.edu Created container upgrade-ipam
Normal Started 9m14s kubelet, thalia0.ahc.umn.edu Started container upgrade-ipam
我尝试将 Flannel 作为 cni,但情况更糟。由于污染,kube-proxy 甚至无法启动!
编辑附录
kube-controller-manager
和 kube-scheduler
应该没有定义端点吗?
[gms@thalia0 ~]$ kubectl get ep --namespace=kube-system -o wide
NAME ENDPOINTS AGE
kube-controller-manager <none> 19h
kube-dns <none> 19h
kube-scheduler <none> 19h
[gms@thalia0 ~]$ kubectl get pods --namespace=kube-system
NAME READY STATUS RESTARTS AGE
coredns-5c98db65d4-nmn4g 0/1 Pending 0 19h
coredns-5c98db65d4-qv8fm 0/1 Pending 0 19h
etcd-thalia0.x.x.edu. 1/1 Running 0 19h
kube-apiserver-thalia0.x.x.edu 1/1 Running 0 19h
kube-controller-manager-thalia0.x.x.edu 1/1 Running 0 19h
kube-proxy-4hrdc 1/1 Running 0 19h
kube-proxy-vb594 1/1 Running 0 19h
kube-proxy-zwrst 1/1 Running 0 19h
kube-scheduler-thalia0.x.x.edu 1/1 Running 0 19h
最后,为了理智起见,我尝试了 v1.13.1,瞧!成功:
NAME READY STATUS RESTARTS AGE
calico-node-pbrps 2/2 Running 0 15s
coredns-86c58d9df4-g5944 1/1 Running 0 2m40s
coredns-86c58d9df4-zntjl 1/1 Running 0 2m40s
etcd-thalia0.ahc.umn.edu 1/1 Running 0 110s
kube-apiserver-thalia0.ahc.umn.edu 1/1 Running 0 105s
kube-controller-manager-thalia0.ahc.umn.edu 1/1 Running 0 103s
kube-proxy-qxh2h 1/1 Running 0 2m39s
kube-scheduler-thalia0.ahc.umn.edu 1/1 Running 0 117s
编辑 2
已尝试 sudo kubeadm upgrade plan
,但在 api 服务器的健康状况和错误证书方面出现错误。
运行 在 api-服务器上:
kubectl logs kube-apiserver-thalia0.x.x.edu --namespace=kube-system1
并得到大量 TLS handshake error from 10.x.x.157:52384: remote error: tls: bad certificate
类型的错误,这些错误来自早已从集群中删除的节点,并且在 master 上出现 kubeadm resets
很久之后,以及 [= kubelet、kubeadm 等的 65=]
为什么会出现这些旧节点?证书不会在 kubeadm init
上重新创建吗?
Coreos-dns 在 Calico 启动之前不会启动,请使用此命令检查您的工作节点是否准备就绪
kubectl get nodes -owide
kubectl describe node <your-node>
或
kubectl get node <your-node> -oyaml
要检查的另一件事是日志中的以下消息:
"Unable to update cni config: No networks found in /etc/cni/net.d"
那个目录里有什么?
可能是cni配置不正确
该目录 /etc/cni/net.d
应包含 2 个文件:
10-calico.conflist calico-kubeconfig
下面是这两个文件的内容,看看你的目录下有没有这样的文件
[root@master net.d]# cat 10-calico.conflist
{
"name": "k8s-pod-network",
"cniVersion": "0.3.0",
"plugins": [
{
"type": "calico",
"log_level": "info",
"datastore_type": "kubernetes",
"nodename": "master",
"mtu": 1440,
"ipam": {
"type": "host-local",
"subnet": "usePodCidr"
},
"policy": {
"type": "k8s"
},
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
}
},
{
"type": "portmap",
"snat": true,
"capabilities": {"portMappings": true}
}
]
}
[root@master net.d]#猫calico-kubeconfig
# Kubeconfig file for Calico CNI plugin.
apiVersion: v1
kind: Config
clusters:
- name: local
cluster:
server: https://[10.20.0.1]:443
certificate-authority-data: LSRt.... tLQJ=
users:
- name: calico
user:
token: "eUJh .... ZBoIA"
contexts:
- name: calico-context
context:
cluster: local
user: calico
current-context: calico-context
这个问题 https://github.com/projectcalico/calico/issues/2699 有类似的症状,表明删除 /var/lib/cni/
解决了这个问题。您可以查看它是否存在,如果存在则将其删除。
刚刚使用 kubeadm v1.15.0 安装了一个主集群。然而,coredns 似乎停留在挂起模式:
coredns-5c98db65d4-4pm65 0/1 Pending 0 2m17s <none> <none> <none> <none>
coredns-5c98db65d4-55hcc 0/1 Pending 0 2m2s <none> <none> <none> <none>
以下是为广告连播显示的内容:
kubectl describe pods coredns-5c98db65d4-4pm65 --namespace=kube-system
Name: coredns-5c98db65d4-4pm65
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: <none>
Labels: k8s-app=kube-dns
pod-template-hash=5c98db65d4
Annotations: <none>
Status: Pending
IP:
Controlled By: ReplicaSet/coredns-5c98db65d4
Containers:
coredns:
Image: k8s.gcr.io/coredns:1.3.1
Ports: 53/UDP, 53/TCP, 9153/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
-conf
/etc/coredns/Corefile
Limits:
memory: 170Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
Readiness: http-get http://:8080/health delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/etc/coredns from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from coredns-token-5t2wn (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: coredns
Optional: false
coredns-token-5t2wn:
Type: Secret (a volume populated by a Secret)
SecretName: coredns-token-5t2wn
Optional: false
QoS Class: Burstable
Node-Selectors: beta.kubernetes.io/os=linux
Tolerations: CriticalAddonsOnly
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 61s (x4 over 5m21s) default-scheduler 0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate.
我删除了主节点上的污点,但无济于事。我不应该能够毫无问题地创建一个单节点主机吗?我知道如果不移除污点就不可能在 master 上调度 pods,但这很奇怪。
我尝试添加最新的 calico cni,也没有用。
我得到以下 运行 journalctl(systemctl 显示没有错误):
sudo journalctl -xn --unit kubelet.service
[sudo] password for gms:
-- Logs begin at Fri 2019-07-12 04:31:34 CDT, end at Tue 2019-07-16 16:58:17 CDT. --
Jul 16 16:57:54 thalia0.ahc.umn.edu kubelet[11250]: E0716 16:57:54.122355 11250 kubelet.go:2169] Container runtime network not ready: NetworkReady=false reason:NetworkPl
Jul 16 16:57:54 thalia0.ahc.umn.edu kubelet[11250]: W0716 16:57:54.400606 11250 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
Jul 16 16:57:59 thalia0.ahc.umn.edu kubelet[11250]: E0716 16:57:59.124863 11250 kubelet.go:2169] Container runtime network not ready: NetworkReady=false reason:NetworkPl
Jul 16 16:57:59 thalia0.ahc.umn.edu kubelet[11250]: W0716 16:57:59.400924 11250 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
Jul 16 16:58:04 thalia0.ahc.umn.edu kubelet[11250]: E0716 16:58:04.127120 11250 kubelet.go:2169] Container runtime network not ready: NetworkReady=false reason:NetworkPl
Jul 16 16:58:04 thalia0.ahc.umn.edu kubelet[11250]: W0716 16:58:04.401266 11250 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
Jul 16 16:58:09 thalia0.ahc.umn.edu kubelet[11250]: E0716 16:58:09.129287 11250 kubelet.go:2169] Container runtime network not ready: NetworkReady=false reason:NetworkPl
Jul 16 16:58:09 thalia0.ahc.umn.edu kubelet[11250]: W0716 16:58:09.401520 11250 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
Jul 16 16:58:14 thalia0.ahc.umn.edu kubelet[11250]: E0716 16:58:14.133059 11250 kubelet.go:2169] Container runtime network not ready: NetworkReady=false reason:NetworkPl
Jul 16 16:58:14 thalia0.ahc.umn.edu kubelet[11250]: W0716 16:58:14.402008 11250 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
确实,当我查看 /etc/cni/net.d
时,那里什么也没有 -> 是的,我 运行 kubectl apply -f https://docs.projectcalico.org/v3.8/manifests/calico.yaml
...这是我应用它时的输出:
configmap/calico-config created
customresourcedefinition.apiextensions.k8s.io/felixconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamblocks.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/blockaffinities.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamhandles.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamconfigs.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/bgppeers.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/bgpconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ippools.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/hostendpoints.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/clusterinformations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworksets.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networksets.crd.projectcalico.org created
clusterrole.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrolebinding.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrole.rbac.authorization.k8s.io/calico-node created
clusterrolebinding.rbac.authorization.k8s.io/calico-node created
daemonset.apps/calico-node created
serviceaccount/calico-node created
deployment.apps/calico-kube-controllers created
serviceaccount/calico-kube-controllers created
我在 calico-node 的 pod 上 运行 以下内容,卡在以下状态:
calico-node-tcfhw 0/1 Init:0/3 0 11m 10.32.3.158
describe pods calico-node-tcfhw --namespace=kube-system
Name: calico-node-tcfhw
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: thalia0.ahc.umn.edu/10.32.3.158
Start Time: Tue, 16 Jul 2019 18:08:25 -0500
Labels: controller-revision-hash=844ddd97c6
k8s-app=calico-node
pod-template-generation=1
Annotations: scheduler.alpha.kubernetes.io/critical-pod:
Status: Pending
IP: 10.32.3.158
Controlled By: DaemonSet/calico-node
Init Containers:
upgrade-ipam:
Container ID: docker://1e1bf9e65cb182656f6f06a1bb8291237562f0f5a375e557a454942e81d32063
Image: calico/cni:v3.8.0
Image ID: docker-pullable://docker.io/calico/cni@sha256:decba0501ab0658e6e7da2f5625f1eabb8aba5690f9206caba3bf98caca5094c
Port: <none>
Host Port: <none>
Command:
/opt/cni/bin/calico-ipam
-upgrade
State: Running
Started: Tue, 16 Jul 2019 18:08:26 -0500
Ready: False
Restart Count: 0
Environment:
KUBERNETES_NODE_NAME: (v1:spec.nodeName)
CALICO_NETWORKING_BACKEND: <set to the key 'calico_backend' of config map 'calico-config'> Optional: false
Mounts:
/host/opt/cni/bin from cni-bin-dir (rw)
/var/lib/cni/networks from host-local-net-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-b9c6p (ro)
install-cni:
Container ID:
Image: calico/cni:v3.8.0
Image ID:
Port: <none>
Host Port: <none>
Command:
/install-cni.sh
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
CNI_CONF_NAME: 10-calico.conflist
CNI_NETWORK_CONFIG: <set to the key 'cni_network_config' of config map 'calico-config'> Optional: false
KUBERNETES_NODE_NAME: (v1:spec.nodeName)
CNI_MTU: <set to the key 'veth_mtu' of config map 'calico-config'> Optional: false
SLEEP: false
Mounts:
/host/etc/cni/net.d from cni-net-dir (rw)
/host/opt/cni/bin from cni-bin-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-b9c6p (ro)
flexvol-driver:
Container ID:
Image: calico/pod2daemon-flexvol:v3.8.0
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/host/driver from flexvol-driver-host (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-b9c6p (ro)
Containers:
calico-node:
Container ID:
Image: calico/node:v3.8.0
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Requests:
cpu: 250m
Liveness: http-get http://localhost:9099/liveness delay=10s timeout=1s period=10s #success=1 #failure=6
Readiness: exec [/bin/calico-node -bird-ready -felix-ready] delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
DATASTORE_TYPE: kubernetes
WAIT_FOR_DATASTORE: true
NODENAME: (v1:spec.nodeName)
CALICO_NETWORKING_BACKEND: <set to the key 'calico_backend' of config map 'calico-config'> Optional: false
CLUSTER_TYPE: k8s,bgp
IP: autodetect
CALICO_IPV4POOL_IPIP: Always
FELIX_IPINIPMTU: <set to the key 'veth_mtu' of config map 'calico-config'> Optional: false
CALICO_IPV4POOL_CIDR: 192.168.0.0/16
CALICO_DISABLE_FILE_LOGGING: true
FELIX_DEFAULTENDPOINTTOHOSTACTION: ACCEPT
FELIX_IPV6SUPPORT: false
FELIX_LOGSEVERITYSCREEN: info
FELIX_HEALTHENABLED: true
Mounts:
/lib/modules from lib-modules (ro)
/run/xtables.lock from xtables-lock (rw)
/var/lib/calico from var-lib-calico (rw)
/var/run/calico from var-run-calico (rw)
/var/run/nodeagent from policysync (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-b9c6p (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
lib-modules:
Type: HostPath (bare host directory volume)
Path: /lib/modules
HostPathType:
var-run-calico:
Type: HostPath (bare host directory volume)
Path: /var/run/calico
HostPathType:
var-lib-calico:
Type: HostPath (bare host directory volume)
Path: /var/lib/calico
HostPathType:
xtables-lock:
Type: HostPath (bare host directory volume)
Path: /run/xtables.lock
HostPathType: FileOrCreate
cni-bin-dir:
Type: HostPath (bare host directory volume)
Path: /opt/cni/bin
HostPathType:
cni-net-dir:
Type: HostPath (bare host directory volume)
Path: /etc/cni/net.d
HostPathType:
host-local-net-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/cni/networks
HostPathType:
policysync:
Type: HostPath (bare host directory volume)
Path: /var/run/nodeagent
HostPathType: DirectoryOrCreate
flexvol-driver-host:
Type: HostPath (bare host directory volume)
Path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
HostPathType: DirectoryOrCreate
calico-node-token-b9c6p:
Type: Secret (a volume populated by a Secret)
SecretName: calico-node-token-b9c6p
Optional: false
QoS Class: Burstable
Node-Selectors: beta.kubernetes.io/os=linux
Tolerations: :NoSchedule
:NoExecute
CriticalAddonsOnly
node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/network-unavailable:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/pid-pressure:NoSchedule
node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unschedulable:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 9m15s default-scheduler Successfully assigned kube-system/calico-node-tcfhw to thalia0.ahc.umn.edu
Normal Pulled 9m14s kubelet, thalia0.ahc.umn.edu Container image "calico/cni:v3.8.0" already present on machine
Normal Created 9m14s kubelet, thalia0.ahc.umn.edu Created container upgrade-ipam
Normal Started 9m14s kubelet, thalia0.ahc.umn.edu Started container upgrade-ipam
我尝试将 Flannel 作为 cni,但情况更糟。由于污染,kube-proxy 甚至无法启动!
编辑附录
kube-controller-manager
和 kube-scheduler
应该没有定义端点吗?
[gms@thalia0 ~]$ kubectl get ep --namespace=kube-system -o wide
NAME ENDPOINTS AGE
kube-controller-manager <none> 19h
kube-dns <none> 19h
kube-scheduler <none> 19h
[gms@thalia0 ~]$ kubectl get pods --namespace=kube-system
NAME READY STATUS RESTARTS AGE
coredns-5c98db65d4-nmn4g 0/1 Pending 0 19h
coredns-5c98db65d4-qv8fm 0/1 Pending 0 19h
etcd-thalia0.x.x.edu. 1/1 Running 0 19h
kube-apiserver-thalia0.x.x.edu 1/1 Running 0 19h
kube-controller-manager-thalia0.x.x.edu 1/1 Running 0 19h
kube-proxy-4hrdc 1/1 Running 0 19h
kube-proxy-vb594 1/1 Running 0 19h
kube-proxy-zwrst 1/1 Running 0 19h
kube-scheduler-thalia0.x.x.edu 1/1 Running 0 19h
最后,为了理智起见,我尝试了 v1.13.1,瞧!成功:
NAME READY STATUS RESTARTS AGE
calico-node-pbrps 2/2 Running 0 15s
coredns-86c58d9df4-g5944 1/1 Running 0 2m40s
coredns-86c58d9df4-zntjl 1/1 Running 0 2m40s
etcd-thalia0.ahc.umn.edu 1/1 Running 0 110s
kube-apiserver-thalia0.ahc.umn.edu 1/1 Running 0 105s
kube-controller-manager-thalia0.ahc.umn.edu 1/1 Running 0 103s
kube-proxy-qxh2h 1/1 Running 0 2m39s
kube-scheduler-thalia0.ahc.umn.edu 1/1 Running 0 117s
编辑 2
已尝试 sudo kubeadm upgrade plan
,但在 api 服务器的健康状况和错误证书方面出现错误。
运行 在 api-服务器上:
kubectl logs kube-apiserver-thalia0.x.x.edu --namespace=kube-system1
并得到大量 TLS handshake error from 10.x.x.157:52384: remote error: tls: bad certificate
类型的错误,这些错误来自早已从集群中删除的节点,并且在 master 上出现 kubeadm resets
很久之后,以及 [= kubelet、kubeadm 等的 65=]
为什么会出现这些旧节点?证书不会在 kubeadm init
上重新创建吗?
Coreos-dns 在 Calico 启动之前不会启动,请使用此命令检查您的工作节点是否准备就绪
kubectl get nodes -owide
kubectl describe node <your-node>
或
kubectl get node <your-node> -oyaml
要检查的另一件事是日志中的以下消息:
"Unable to update cni config: No networks found in /etc/cni/net.d"
那个目录里有什么?
可能是cni配置不正确
该目录 /etc/cni/net.d
应包含 2 个文件:
10-calico.conflist calico-kubeconfig
下面是这两个文件的内容,看看你的目录下有没有这样的文件
[root@master net.d]# cat 10-calico.conflist
{
"name": "k8s-pod-network",
"cniVersion": "0.3.0",
"plugins": [
{
"type": "calico",
"log_level": "info",
"datastore_type": "kubernetes",
"nodename": "master",
"mtu": 1440,
"ipam": {
"type": "host-local",
"subnet": "usePodCidr"
},
"policy": {
"type": "k8s"
},
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
}
},
{
"type": "portmap",
"snat": true,
"capabilities": {"portMappings": true}
}
]
}
[root@master net.d]#猫calico-kubeconfig
# Kubeconfig file for Calico CNI plugin.
apiVersion: v1
kind: Config
clusters:
- name: local
cluster:
server: https://[10.20.0.1]:443
certificate-authority-data: LSRt.... tLQJ=
users:
- name: calico
user:
token: "eUJh .... ZBoIA"
contexts:
- name: calico-context
context:
cluster: local
user: calico
current-context: calico-context
这个问题 https://github.com/projectcalico/calico/issues/2699 有类似的症状,表明删除 /var/lib/cni/
解决了这个问题。您可以查看它是否存在,如果存在则将其删除。