使用 kops 创建 kubernetes 集群失败

Question

我正在尝试在 aws 上使用 kops 创建一个非常简单的集群，其中包含一个主节点和 2 个工作节点。但是在创建之后，kops validate cluster 抱怨集群不健康。

集群创建方式：

kops create cluster --name=mycluster --zones=ap-south-1a --master-size="t2.micro" --node-size="t2.micro" --node-count="2" --cloud aws --ssh-public-key="~/.ssh/id_rsa.pub"

Output from kops validate cluster:

VALIDATION ERRORS
KIND    NAME                                                                                    MESSAGE
Pod     kube-system/kops-controller-xxxtk                                                       system-node-critical pod "kops-controller-xxxtk" is not ready (kops-controller)
Pod     kube-system/kube-controller-manager-ip-xxx-xxx-xxx-xxx.ap-south-1.compute.internal        system-cluster-critical pod "kube-controller-manager-ip-xxx-xxx-xxx-xxx.ap-south-1.compute.internal" is not ready (kube-controller-manager)

Validation Failed

Validation failed: cluster not yet healthy

获取 kube-system 命名空间中的资源显示：

NAME                                                                       READY   STATUS             RESTARTS   AGE
pod/dns-controller-8d8889c4b-rwnkd                                         1/1     Running            0          47m
pod/etcd-manager-events-ip-xxx-xxx-xxx-xxx..ap-south-1.compute.internal       1/1     Running            0          72m
pod/etcd-manager-main-ip-xxx-xxx-xxx-xxx.ap-south-1.compute.internal         1/1     Running            0          72m
pod/kops-controller-xxxtk                                                  1/1     Running            11         70m
pod/kube-apiserver-ip-xxx-xxx-xxx-xxx.ap-south-1.compute.internal            2/2     Running            1          72m
pod/kube-controller-manager-ip-xxx-xxx-xxx-xxx.ap-south-1.compute.internal   0/1     CrashLoopBackOff   15         72m
pod/kube-dns-696cb84c7-qzqf2                                               3/3     Running            0          16h
pod/kube-dns-696cb84c7-tt7ng                                               3/3     Running            0          16h
pod/kube-dns-autoscaler-55f8f75459-7jbjb                                   1/1     Running            0          16h
pod/kube-proxy-ip-xxx-xxx-xxx-xxx.ap-south-1.compute.internal                1/1     Running            0          16h
pod/kube-proxy-ip-xxx-xxx-xxx-xxx.ap-south-1.compute.internal                1/1     Running            0          72m
pod/kube-proxy-ip-xxx-xxx-xxx-xxx.ap-south-1.compute.internal                1/1     Running            0          16h
pod/kube-scheduler-ip-xxx-xxx-xxx-xxx.ap-south-1.compute.internal            1/1     Running            15         72m

NAME               TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)         AGE
service/kube-dns   ClusterIP   100.64.0.10   <none>        53/UDP,53/TCP   16h

NAME                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                      AGE
daemonset.apps/kops-controller   1         1         1       1            1           kops.k8s.io/kops-controller-pki=,node-role.kubernetes.io/master=   16h

NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/dns-controller        1/1     1            1           16h
deployment.apps/kube-dns              2/2     2            2           16h
deployment.apps/kube-dns-autoscaler   1/1     1            1           16h

NAME                                             DESIRED   CURRENT   READY   AGE
replicaset.apps/dns-controller-8d8889c4b         1         1         1       16h
replicaset.apps/kube-dns-696cb84c7               2         2         2       16h
replicaset.apps/kube-dns-autoscaler-55f8f75459   1         1         1       16h

从 kube-scheduler 获取日志显示：

I0211 04:26:45.546427       1 flags.go:59] FLAG: --vmodule=""
I0211 04:26:45.546442       1 flags.go:59] FLAG: --write-config-to=""
I0211 04:26:46.306497       1 serving.go:331] Generated self-signed cert in-memory
W0211 04:26:47.736258       1 authentication.go:368] failed to read in-cluster kubeconfig for delegated authentication: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
W0211 04:26:47.765649       1 authentication.go:265] No authentication-kubeconfig provided in order to lookup client-ca-file in configmap/extension-apiserver-authentication in kube-system, so client certificate authentication won't work.
W0211 04:26:47.783852       1 authentication.go:289] No authentication-kubeconfig provided in order to lookup requestheader-client-ca-file in configmap/extension-apiserver-authentication in kube-system, so request-header client certificate authentication won't work.
W0211 04:26:47.798838       1 authorization.go:187] failed to read in-cluster kubeconfig for delegated authorization: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
W0211 04:26:47.831825       1 authorization.go:156] No authorization-kubeconfig provided, so SubjectAccessReview of authorization tokens won't work.
I0211 04:26:55.344064       1 factory.go:210] Creating scheduler from algorithm provider 'DefaultProvider'
I0211 04:26:55.370766       1 registry.go:173] Registering SelectorSpread plugin
I0211 04:26:55.370802       1 registry.go:173] Registering SelectorSpread plugin
I0211 04:26:55.504324       1 server.go:146] Starting Kubernetes Scheduler version v1.19.7
W0211 04:26:55.607516       1 authorization.go:47] Authorization is disabled
W0211 04:26:55.607537       1 authentication.go:40] Authentication is disabled
I0211 04:26:55.618714       1 deprecated_insecure_serving.go:51] Serving healthz insecurely on [::]:10251
I0211 04:26:55.741863       1 tlsconfig.go:200] loaded serving cert ["Generated self signed cert"]: "localhost@1613017606" [serving] validServingFor=[127.0.0.1,localhost,localhost] issuer="localhost-ca@1613017605" (2021-02-11 03:26:45 +0000 UTC to 2022-02-11 03:26:45 +0000 UTC (now=2021-02-11 04:26:55.741788572 +0000 UTC))
I0211 04:26:55.746888       1 named_certificates.go:53] loaded SNI cert [0/"self-signed loopback"]: "apiserver-loopback-client@1613017607" [serving] validServingFor=[apiserver-loopback-client] issuer="apiserver-loopback-client-ca@1613017607" (2021-02-11 03:26:46 +0000 UTC to 2022-02-11 03:26:46 +0000 UTC (now=2021-02-11 04:26:55.7468713 +0000 UTC))
I0211 04:26:55.757881       1 tlsconfig.go:240] Starting DynamicServingCertificateController
I0211 04:26:55.771581       1 secure_serving.go:197] Serving securely on [::]:10259
I0211 04:26:55.793134       1 reflector.go:207] Starting reflector *v1.StorageClass (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.815641       1 reflector.go:207] Starting reflector *v1.CSINode (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.841309       1 reflector.go:207] Starting reflector *v1beta1.PodDisruptionBudget (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.857460       1 reflector.go:207] Starting reflector *v1.Pod (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.875096       1 reflector.go:207] Starting reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.894283       1 reflector.go:207] Starting reflector *v1.Service (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.894615       1 reflector.go:207] Starting reflector *v1.PersistentVolume (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.895000       1 reflector.go:207] Starting reflector *v1.ReplicationController (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.895250       1 reflector.go:207] Starting reflector *v1.ReplicaSet (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.902323       1 reflector.go:207] Starting reflector *v1.StatefulSet (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.902572       1 reflector.go:207] Starting reflector *v1.PersistentVolumeClaim (0s) from k8s.io/client-go/informers/factory.go:134
I0211 04:26:55.905927       1 reflector.go:207] Starting reflector *v1.Pod (0s) from k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:188
I0211 04:26:56.355570       1 node_tree.go:86] Added node "ip-172-20-43-190.ap-south-1.compute.internal" in group "ap-south-1:\x00:ap-south-1a" to NodeTree
I0211 04:26:56.357441       1 node_tree.go:86] Added node "ip-172-20-63-116.ap-south-1.compute.internal" in group "ap-south-1:\x00:ap-south-1a" to NodeTree
I0211 04:26:56.357578       1 node_tree.go:86] Added node "ip-172-20-60-103.ap-south-1.compute.internal" in group "ap-south-1:\x00:ap-south-1a" to NodeTree
I0211 04:26:56.377402       1 leaderelection.go:243] attempting to acquire leader lease  kube-system/kube-scheduler...
I0211 04:27:12.368681       1 leaderelection.go:253] successfully acquired lease kube-system/kube-scheduler
I0211 04:27:12.436915       1 scheduler.go:597] "Successfully bound pod to node" pod="default/nginx-deployment-66b6c48dd5-w4hb5" node="ip-172-20-63-116.ap-south-1.compute.internal" evaluatedNodes=3 feasibleNodes=2
I0211 04:27:12.451792       1 scheduler.go:597] "Successfully bound pod to node" pod="default/nginx-deployment-66b6c48dd5-4xz8l" node="ip-172-20-43-190.ap-south-1.compute.internal" evaluatedNodes=3 feasibleNodes=2
E0211 04:32:20.487059       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-scheduler: Get "https://127.0.0.1/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=10s": context deadline exceeded
I0211 04:32:20.633059       1 leaderelection.go:278] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition
F0211 04:32:20.673521       1 server.go:199] leaderelection lost
goroutine 1 [running]:
k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0xc0005c2d01, 0xc000900800, 0x41, 0x1fd)
        /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:996 +0xb9
....
... stack trace from go runtime

Answer 1

我没有发现你的命令有什么特别的错误运行。但是，t2.micro 非常小，对于集群来说可能太小了。

您可以查看 kops-operator 日志，查看它未启动的原因。试试 kubectl logs kops-controller-xxxx 和 kubectl describe pod kops-controller-xxx

Answer 2

你知道，在@Markus 和你的评论之后，我开始更深入地挖掘信息，这就是我的发现。

第一篇 Running Kubernetes on AWS T2 Instances 文章。使用 T2.medium 的示例，其中包含非常详细的步骤和时间线，描述了那里发生的事情。

最后的结论：

We’ve shown that the unpredictable nature of deployments on Kubernetes clusters isn’t a good fit for the T2/3 family of instances. There is the potential to have instances throttled due to pods consuming vast amounts of resources. At best this will limit the performance of your applications and at worst could cause the cluster to fail (if using T2/3s for master nodes) due to ETCD issues. Furthermore, this condition will only be picked up if we are monitoring CloudWatch carefully or performing application performance monitoring on the pods.

To this end it is advisable to avoid using T2/3 instance type families for Kubernetes deployments, if you would like to save money whilst using more traditional instance families (such as Ms and Rs) then take a look at our blog on spot instances.

在官方信息旁边：

1) t2.micro spec: T2.micro 是 1 vCPU 和 1 gb 内存 t2.micro specs:

2) 一般 Kubernetes 所需的最小内存和 CPU（内核）：

主节点所需的最小内存为 2GB，工作节点所需的最小内存为 1GB
master节点至少需要1.5，worker节点至少需要0.7cores.

资源不足。 master

请使用最小值T2.medium

使用 kops 创建 kubernetes 集群失败

Failure on kubernetes cluster creation with kops

kubernetes

kops