使用 kops 创建 kubernetes 集群失败
Failure on kubernetes cluster creation with kops
我正在尝试在 aws 上使用 kops 创建一个非常简单的集群,其中包含一个主节点和 2 个工作节点。但是在创建之后,kops validate cluster 抱怨集群不健康。
集群创建方式:
kops create cluster --name=mycluster --zones=ap-south-1a --master-size="t2.micro" --node-size="t2.micro" --node-count="2" --cloud aws --ssh-public-key="~/.ssh/id_rsa.pub"
Output from kops validate cluster:
VALIDATION ERRORS
KIND NAME MESSAGE
Pod kube-system/kops-controller-xxxtk system-node-critical pod "kops-controller-xxxtk" is not ready (kops-controller)
Pod kube-system/kube-controller-manager-ip-xxx-xxx-xxx-xxx.ap-south-1.compute.internal system-cluster-critical pod "kube-controller-manager-ip-xxx-xxx-xxx-xxx.ap-south-1.compute.internal" is not ready (kube-controller-manager)
Validation Failed
Validation failed: cluster not yet healthy
获取 kube-system 命名空间中的资源显示:
NAME READY STATUS RESTARTS AGE
pod/dns-controller-8d8889c4b-rwnkd 1/1 Running 0 47m
pod/etcd-manager-events-ip-xxx-xxx-xxx-xxx..ap-south-1.compute.internal 1/1 Running 0 72m
pod/etcd-manager-main-ip-xxx-xxx-xxx-xxx.ap-south-1.compute.internal 1/1 Running 0 72m
pod/kops-controller-xxxtk 1/1 Running 11 70m
pod/kube-apiserver-ip-xxx-xxx-xxx-xxx.ap-south-1.compute.internal 2/2 Running 1 72m
pod/kube-controller-manager-ip-xxx-xxx-xxx-xxx.ap-south-1.compute.internal 0/1 CrashLoopBackOff 15 72m
pod/kube-dns-696cb84c7-qzqf2 3/3 Running 0 16h
pod/kube-dns-696cb84c7-tt7ng 3/3 Running 0 16h
pod/kube-dns-autoscaler-55f8f75459-7jbjb 1/1 Running 0 16h
pod/kube-proxy-ip-xxx-xxx-xxx-xxx.ap-south-1.compute.internal 1/1 Running 0 16h
pod/kube-proxy-ip-xxx-xxx-xxx-xxx.ap-south-1.compute.internal 1/1 Running 0 72m
pod/kube-proxy-ip-xxx-xxx-xxx-xxx.ap-south-1.compute.internal 1/1 Running 0 16h
pod/kube-scheduler-ip-xxx-xxx-xxx-xxx.ap-south-1.compute.internal 1/1 Running 15 72m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kube-dns ClusterIP 100.64.0.10 <none> 53/UDP,53/TCP 16h
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/kops-controller 1 1 1 1 1 kops.k8s.io/kops-controller-pki=,node-role.kubernetes.io/master= 16h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/dns-controller 1/1 1 1 16h
deployment.apps/kube-dns 2/2 2 2 16h
deployment.apps/kube-dns-autoscaler 1/1 1 1 16h
NAME DESIRED CURRENT READY AGE
replicaset.apps/dns-controller-8d8889c4b 1 1 1 16h
replicaset.apps/kube-dns-696cb84c7 2 2 2 16h
replicaset.apps/kube-dns-autoscaler-55f8f75459 1 1 1 16h
从 kube-scheduler 获取日志显示:
I0211 04:26:45.546427 1 flags.go:59] FLAG: --vmodule=""
I0211 04:26:45.546442 1 flags.go:59] FLAG: --write-config-to=""
I0211 04:26:46.306497 1 serving.go:331] Generated self-signed cert in-memory
W0211 04:26:47.736258 1 authentication.go:368] failed to read in-cluster kubeconfig for delegated authentication: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
W0211 04:26:47.765649 1 authentication.go:265] No authentication-kubeconfig provided in order to lookup client-ca-file in configmap/extension-apiserver-authentication in kube-system, so client certificate authentication won't work.
W0211 04:26:47.783852 1 authentication.go:289] No authentication-kubeconfig provided in order to lookup requestheader-client-ca-file in configmap/extension-apiserver-authentication in kube-system, so request-header client certificate authentication won't work.
W0211 04:26:47.798838 1 authorization.go:187] failed to read in-cluster kubeconfig for delegated authorization: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
W0211 04:26:47.831825 1 authorization.go:156] No authorization-kubeconfig provided, so SubjectAccessReview of authorization tokens won't work.
I0211 04:26:55.344064 1 factory.go:210] Creating scheduler from algorithm provider 'DefaultProvider'
I0211 04:26:55.370766 1 registry.go:173] Registering SelectorSpread plugin
I0211 04:26:55.370802 1 registry.go:173] Registering SelectorSpread plugin
I0211 04:26:55.504324 1 server.go:146] Starting Kubernetes Scheduler version v1.19.7
W0211 04:26:55.607516 1 authorization.go:47] Authorization is disabled
W0211 04:26:55.607537 1 authentication.go:40] Authentication is disabled
I0211 04:26:55.618714 1 deprecated_insecure_serving.go:51] Serving healthz insecurely on [::]:10251
I0211 04:26:55.741863 1 tlsconfig.go:200] loaded serving cert ["Generated self signed cert"]: "localhost@1613017606" [serving] validServingFor=[127.0.0.1,localhost,localhost] issuer="localhost-ca@1613017605" (2021-02-11 03:26:45 +0000 UTC to 2022-02-11 03:26:45 +0000 UTC (now=2021-02-11 04:26:55.741788572 +0000 UTC))
I0211 04:26:55.746888 1 named_certificates.go:53] loaded SNI cert [0/"self-signed loopback"]: "apiserver-loopback-client@1613017607" [serving] validServingFor=[apiserver-loopback-client] issuer="apiserver-loopback-client-ca@1613017607" (2021-02-11 03:26:46 +0000 UTC to 2022-02-11 03:26:46 +0000 UTC (now=2021-02-11 04:26:55.7468713 +0000 UTC))
I0211 04:26:55.757881 1 tlsconfig.go:240] Starting DynamicServingCertificateController
I0211 04:26:55.771581 1 secure_serving.go:197] Serving securely on [::]:10259
I0211 04:26:55.793134 1 reflector.go:207] Starting reflector *v1.StorageClass (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.815641 1 reflector.go:207] Starting reflector *v1.CSINode (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.841309 1 reflector.go:207] Starting reflector *v1beta1.PodDisruptionBudget (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.857460 1 reflector.go:207] Starting reflector *v1.Pod (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.875096 1 reflector.go:207] Starting reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.894283 1 reflector.go:207] Starting reflector *v1.Service (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.894615 1 reflector.go:207] Starting reflector *v1.PersistentVolume (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.895000 1 reflector.go:207] Starting reflector *v1.ReplicationController (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.895250 1 reflector.go:207] Starting reflector *v1.ReplicaSet (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.902323 1 reflector.go:207] Starting reflector *v1.StatefulSet (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.902572 1 reflector.go:207] Starting reflector *v1.PersistentVolumeClaim (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.905927 1 reflector.go:207] Starting reflector *v1.Pod (0s) from k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:188
I0211 04:26:56.355570 1 node_tree.go:86] Added node "ip-172-20-43-190.ap-south-1.compute.internal" in group "ap-south-1:\x00:ap-south-1a" to NodeTree
I0211 04:26:56.357441 1 node_tree.go:86] Added node "ip-172-20-63-116.ap-south-1.compute.internal" in group "ap-south-1:\x00:ap-south-1a" to NodeTree
I0211 04:26:56.357578 1 node_tree.go:86] Added node "ip-172-20-60-103.ap-south-1.compute.internal" in group "ap-south-1:\x00:ap-south-1a" to NodeTree
I0211 04:26:56.377402 1 leaderelection.go:243] attempting to acquire leader lease kube-system/kube-scheduler...
I0211 04:27:12.368681 1 leaderelection.go:253] successfully acquired lease kube-system/kube-scheduler
I0211 04:27:12.436915 1 scheduler.go:597] "Successfully bound pod to node" pod="default/nginx-deployment-66b6c48dd5-w4hb5" node="ip-172-20-63-116.ap-south-1.compute.internal" evaluatedNodes=3 feasibleNodes=2
I0211 04:27:12.451792 1 scheduler.go:597] "Successfully bound pod to node" pod="default/nginx-deployment-66b6c48dd5-4xz8l" node="ip-172-20-43-190.ap-south-1.compute.internal" evaluatedNodes=3 feasibleNodes=2
E0211 04:32:20.487059 1 leaderelection.go:325] error retrieving resource lock kube-system/kube-scheduler: Get "https://127.0.0.1/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=10s": context deadline exceeded
I0211 04:32:20.633059 1 leaderelection.go:278] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition
F0211 04:32:20.673521 1 server.go:199] leaderelection lost
goroutine 1 [running]:
k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0xc0005c2d01, 0xc000900800, 0x41, 0x1fd)
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:996 +0xb9
....
... stack trace from go runtime
我没有发现你的命令有什么特别的错误 运行。但是,t2.micro 非常小,对于集群来说可能太小了。
您可以查看 kops-operator 日志,查看它未启动的原因。试试 kubectl logs kops-controller-xxxx
和 kubectl describe pod kops-controller-xxx
你知道,在@Markus 和你的评论之后,我开始更深入地挖掘信息,这就是我的发现。
第一篇 Running Kubernetes on AWS T2 Instances 文章。使用 T2.medium 的示例,其中包含非常详细的步骤和时间线,描述了那里发生的事情。
最后的结论:
We’ve shown that the unpredictable nature of deployments on Kubernetes
clusters isn’t a good fit for the T2/3 family of instances. There is
the potential to have instances throttled due to pods consuming vast
amounts of resources. At best this will limit the performance of your
applications and at worst could cause the cluster to fail (if using
T2/3s for master nodes) due to ETCD issues. Furthermore, this
condition will only be picked up if we are monitoring CloudWatch
carefully or performing application performance monitoring on the
pods.
To this end it is advisable to avoid using T2/3 instance type families
for Kubernetes deployments, if you would like to save money whilst
using more traditional instance families (such as Ms and Rs) then take
a look at our blog on spot instances.
在官方信息旁边:
1) t2.micro spec: T2.micro 是 1 vCPU 和 1 gb 内存
t2.micro specs:
2) 一般 Kubernetes 所需的最小内存和 CPU(内核):
主节点所需的最小内存为 2GB,工作节点所需的最小内存为 1GB
master节点至少需要1.5,worker节点至少需要0.7cores.
资源不足。 master
请使用最小值T2.medium
我正在尝试在 aws 上使用 kops 创建一个非常简单的集群,其中包含一个主节点和 2 个工作节点。但是在创建之后,kops validate cluster 抱怨集群不健康。
集群创建方式:
kops create cluster --name=mycluster --zones=ap-south-1a --master-size="t2.micro" --node-size="t2.micro" --node-count="2" --cloud aws --ssh-public-key="~/.ssh/id_rsa.pub"
Output from kops validate cluster:
VALIDATION ERRORS
KIND NAME MESSAGE
Pod kube-system/kops-controller-xxxtk system-node-critical pod "kops-controller-xxxtk" is not ready (kops-controller)
Pod kube-system/kube-controller-manager-ip-xxx-xxx-xxx-xxx.ap-south-1.compute.internal system-cluster-critical pod "kube-controller-manager-ip-xxx-xxx-xxx-xxx.ap-south-1.compute.internal" is not ready (kube-controller-manager)
Validation Failed
Validation failed: cluster not yet healthy
获取 kube-system 命名空间中的资源显示:
NAME READY STATUS RESTARTS AGE
pod/dns-controller-8d8889c4b-rwnkd 1/1 Running 0 47m
pod/etcd-manager-events-ip-xxx-xxx-xxx-xxx..ap-south-1.compute.internal 1/1 Running 0 72m
pod/etcd-manager-main-ip-xxx-xxx-xxx-xxx.ap-south-1.compute.internal 1/1 Running 0 72m
pod/kops-controller-xxxtk 1/1 Running 11 70m
pod/kube-apiserver-ip-xxx-xxx-xxx-xxx.ap-south-1.compute.internal 2/2 Running 1 72m
pod/kube-controller-manager-ip-xxx-xxx-xxx-xxx.ap-south-1.compute.internal 0/1 CrashLoopBackOff 15 72m
pod/kube-dns-696cb84c7-qzqf2 3/3 Running 0 16h
pod/kube-dns-696cb84c7-tt7ng 3/3 Running 0 16h
pod/kube-dns-autoscaler-55f8f75459-7jbjb 1/1 Running 0 16h
pod/kube-proxy-ip-xxx-xxx-xxx-xxx.ap-south-1.compute.internal 1/1 Running 0 16h
pod/kube-proxy-ip-xxx-xxx-xxx-xxx.ap-south-1.compute.internal 1/1 Running 0 72m
pod/kube-proxy-ip-xxx-xxx-xxx-xxx.ap-south-1.compute.internal 1/1 Running 0 16h
pod/kube-scheduler-ip-xxx-xxx-xxx-xxx.ap-south-1.compute.internal 1/1 Running 15 72m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kube-dns ClusterIP 100.64.0.10 <none> 53/UDP,53/TCP 16h
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/kops-controller 1 1 1 1 1 kops.k8s.io/kops-controller-pki=,node-role.kubernetes.io/master= 16h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/dns-controller 1/1 1 1 16h
deployment.apps/kube-dns 2/2 2 2 16h
deployment.apps/kube-dns-autoscaler 1/1 1 1 16h
NAME DESIRED CURRENT READY AGE
replicaset.apps/dns-controller-8d8889c4b 1 1 1 16h
replicaset.apps/kube-dns-696cb84c7 2 2 2 16h
replicaset.apps/kube-dns-autoscaler-55f8f75459 1 1 1 16h
从 kube-scheduler 获取日志显示:
I0211 04:26:45.546427 1 flags.go:59] FLAG: --vmodule=""
I0211 04:26:45.546442 1 flags.go:59] FLAG: --write-config-to=""
I0211 04:26:46.306497 1 serving.go:331] Generated self-signed cert in-memory
W0211 04:26:47.736258 1 authentication.go:368] failed to read in-cluster kubeconfig for delegated authentication: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
W0211 04:26:47.765649 1 authentication.go:265] No authentication-kubeconfig provided in order to lookup client-ca-file in configmap/extension-apiserver-authentication in kube-system, so client certificate authentication won't work.
W0211 04:26:47.783852 1 authentication.go:289] No authentication-kubeconfig provided in order to lookup requestheader-client-ca-file in configmap/extension-apiserver-authentication in kube-system, so request-header client certificate authentication won't work.
W0211 04:26:47.798838 1 authorization.go:187] failed to read in-cluster kubeconfig for delegated authorization: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
W0211 04:26:47.831825 1 authorization.go:156] No authorization-kubeconfig provided, so SubjectAccessReview of authorization tokens won't work.
I0211 04:26:55.344064 1 factory.go:210] Creating scheduler from algorithm provider 'DefaultProvider'
I0211 04:26:55.370766 1 registry.go:173] Registering SelectorSpread plugin
I0211 04:26:55.370802 1 registry.go:173] Registering SelectorSpread plugin
I0211 04:26:55.504324 1 server.go:146] Starting Kubernetes Scheduler version v1.19.7
W0211 04:26:55.607516 1 authorization.go:47] Authorization is disabled
W0211 04:26:55.607537 1 authentication.go:40] Authentication is disabled
I0211 04:26:55.618714 1 deprecated_insecure_serving.go:51] Serving healthz insecurely on [::]:10251
I0211 04:26:55.741863 1 tlsconfig.go:200] loaded serving cert ["Generated self signed cert"]: "localhost@1613017606" [serving] validServingFor=[127.0.0.1,localhost,localhost] issuer="localhost-ca@1613017605" (2021-02-11 03:26:45 +0000 UTC to 2022-02-11 03:26:45 +0000 UTC (now=2021-02-11 04:26:55.741788572 +0000 UTC))
I0211 04:26:55.746888 1 named_certificates.go:53] loaded SNI cert [0/"self-signed loopback"]: "apiserver-loopback-client@1613017607" [serving] validServingFor=[apiserver-loopback-client] issuer="apiserver-loopback-client-ca@1613017607" (2021-02-11 03:26:46 +0000 UTC to 2022-02-11 03:26:46 +0000 UTC (now=2021-02-11 04:26:55.7468713 +0000 UTC))
I0211 04:26:55.757881 1 tlsconfig.go:240] Starting DynamicServingCertificateController
I0211 04:26:55.771581 1 secure_serving.go:197] Serving securely on [::]:10259
I0211 04:26:55.793134 1 reflector.go:207] Starting reflector *v1.StorageClass (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.815641 1 reflector.go:207] Starting reflector *v1.CSINode (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.841309 1 reflector.go:207] Starting reflector *v1beta1.PodDisruptionBudget (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.857460 1 reflector.go:207] Starting reflector *v1.Pod (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.875096 1 reflector.go:207] Starting reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.894283 1 reflector.go:207] Starting reflector *v1.Service (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.894615 1 reflector.go:207] Starting reflector *v1.PersistentVolume (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.895000 1 reflector.go:207] Starting reflector *v1.ReplicationController (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.895250 1 reflector.go:207] Starting reflector *v1.ReplicaSet (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.902323 1 reflector.go:207] Starting reflector *v1.StatefulSet (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.902572 1 reflector.go:207] Starting reflector *v1.PersistentVolumeClaim (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.905927 1 reflector.go:207] Starting reflector *v1.Pod (0s) from k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:188
I0211 04:26:56.355570 1 node_tree.go:86] Added node "ip-172-20-43-190.ap-south-1.compute.internal" in group "ap-south-1:\x00:ap-south-1a" to NodeTree
I0211 04:26:56.357441 1 node_tree.go:86] Added node "ip-172-20-63-116.ap-south-1.compute.internal" in group "ap-south-1:\x00:ap-south-1a" to NodeTree
I0211 04:26:56.357578 1 node_tree.go:86] Added node "ip-172-20-60-103.ap-south-1.compute.internal" in group "ap-south-1:\x00:ap-south-1a" to NodeTree
I0211 04:26:56.377402 1 leaderelection.go:243] attempting to acquire leader lease kube-system/kube-scheduler...
I0211 04:27:12.368681 1 leaderelection.go:253] successfully acquired lease kube-system/kube-scheduler
I0211 04:27:12.436915 1 scheduler.go:597] "Successfully bound pod to node" pod="default/nginx-deployment-66b6c48dd5-w4hb5" node="ip-172-20-63-116.ap-south-1.compute.internal" evaluatedNodes=3 feasibleNodes=2
I0211 04:27:12.451792 1 scheduler.go:597] "Successfully bound pod to node" pod="default/nginx-deployment-66b6c48dd5-4xz8l" node="ip-172-20-43-190.ap-south-1.compute.internal" evaluatedNodes=3 feasibleNodes=2
E0211 04:32:20.487059 1 leaderelection.go:325] error retrieving resource lock kube-system/kube-scheduler: Get "https://127.0.0.1/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=10s": context deadline exceeded
I0211 04:32:20.633059 1 leaderelection.go:278] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition
F0211 04:32:20.673521 1 server.go:199] leaderelection lost
goroutine 1 [running]:
k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0xc0005c2d01, 0xc000900800, 0x41, 0x1fd)
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:996 +0xb9
....
... stack trace from go runtime
我没有发现你的命令有什么特别的错误 运行。但是,t2.micro 非常小,对于集群来说可能太小了。
您可以查看 kops-operator 日志,查看它未启动的原因。试试 kubectl logs kops-controller-xxxx
和 kubectl describe pod kops-controller-xxx
你知道,在@Markus 和你的评论之后,我开始更深入地挖掘信息,这就是我的发现。
第一篇 Running Kubernetes on AWS T2 Instances 文章。使用 T2.medium 的示例,其中包含非常详细的步骤和时间线,描述了那里发生的事情。
最后的结论:
We’ve shown that the unpredictable nature of deployments on Kubernetes clusters isn’t a good fit for the T2/3 family of instances. There is the potential to have instances throttled due to pods consuming vast amounts of resources. At best this will limit the performance of your applications and at worst could cause the cluster to fail (if using T2/3s for master nodes) due to ETCD issues. Furthermore, this condition will only be picked up if we are monitoring CloudWatch carefully or performing application performance monitoring on the pods.
To this end it is advisable to avoid using T2/3 instance type families for Kubernetes deployments, if you would like to save money whilst using more traditional instance families (such as Ms and Rs) then take a look at our blog on spot instances.
在官方信息旁边:
1) t2.micro spec: T2.micro 是 1 vCPU 和 1 gb 内存
t2.micro specs:
2) 一般 Kubernetes 所需的最小内存和 CPU(内核):
主节点所需的最小内存为 2GB,工作节点所需的最小内存为 1GB
master节点至少需要1.5,worker节点至少需要0.7cores.
资源不足。 master
请使用最小值T2.medium