"server doesn't have a resource type "pods"" 在安装 NVIDIA Clara Deploy 时

"server doesn't have a resource type "pods"" while installing NVIDIA Clara Deploy

我正在尝试按照官方文档 (this & this) 安装最新版本的 NVIDIA Clara Deploy Bootstrap。在安装的一个步骤中,这是一个名为“bootstrap.sh”的 shellscript - 用于安装所有依赖项,包括 Kubernetes 和 kubectl,以及集群创建。但是在 运行ning sudo ./bootstrap.sh 之后,我收到了这个错误:error: the server doesn't have a resource type "pods".

到目前为止我做了什么: 我是 Kubernetes 的新手。所以我尝试了 this answer 的解决方案,尝试了 运行 kubectl get pods,这给了我 No resources found.。我也试过 kubectl auth can-i get pods,这给了我 yes。在 etc/kubernetes/manifests 里面,它是空的,应该有我从答案中看到的 conf 文件,所以我 运行 sudo kubeadm init.

这是完整的错误信息:

2020-10-17 20:57:37 [INFO]: Clara Deploy SDK System Prerequisites Installation
2020-10-17 20:57:37 [INFO]: Checking user privilege...
 
2020-10-17 20:57:37 [INFO]: Checking for NVIDIA GPU driver...
2020-10-17 20:57:37 [INFO]: NVIDIA CUDA driver version found: 418.87.01
2020-10-17 20:57:37 [INFO]: NVIDIA GPU driver found
2020-10-17 20:57:37 [INFO]: Check and install required packages: apt-transport-https ca-certificates curl software-properties-common network-manager unzip lsb-release
 dirmngr jq ...
Ign:1 http://deb.debian.org/debian stretch InRelease
Get:2 http://security.debian.org stretch/updates InRelease [53.0 kB]
Get:3 http://deb.debian.org/debian stretch-updates InRelease [93.6 kB]          
Get:4 http://deb.debian.org/debian stretch-backports InRelease [91.8 kB]               
Hit:5 http://deb.debian.org/debian stretch Release 
Hit:6 http://packages.cloud.google.com/apt gcsfuse-stretch InRelease
Get:7 https://download.docker.com/linux/debian stretch InRelease [44.8 kB]
Get:8 http://packages.cloud.google.com/apt cloud-sdk-stretch InRelease [6,389 B]                                       
Get:9 http://security.debian.org stretch/updates/main Sources [263 kB]                            
Hit:10 http://packages.cloud.google.com/apt google-compute-engine-stretch-stable InRelease             
Get:11 http://security.debian.org stretch/updates/main amd64 Packages [604 kB]                                       
Get:12 http://security.debian.org stretch/updates/main Translation-en [267 kB]                                                 
Hit:13 http://packages.cloud.google.com/apt google-cloud-packages-archive-keyring-stretch InRelease                                   
Hit:14 https://nvidia.github.io/libnvidia-container/stable/debian9/amd64  InRelease            
Hit:16 https://nvidia.github.io/nvidia-container-runtime/stable/debian9/amd64  InRelease
Hit:15 https://packages.cloud.google.com/apt kubernetes-xenial InRelease
Hit:18 https://nvidia.github.io/nvidia-docker/debian9/amd64  InRelease
Fetched 1,424 kB in 1s (1,175 kB/s)
Reading package lists... Done
Reading package lists... Done
Building dependency tree       
Reading state information... Done
apt-transport-https is already the newest version (1.4.10).
ca-certificates is already the newest version (20200601~deb9u1).
dirmngr is already the newest version (2.1.18-8~deb9u4).
jq is already the newest version (1.5+dfsg-1.3).
lsb-release is already the newest version (9.20161125).
network-manager is already the newest version (1.6.2-3+deb9u2).
unzip is already the newest version (6.0-21+deb9u2).
curl is already the newest version (7.52.1-5+deb9u12).
software-properties-common is already the newest version (0.96.20.2-1+deb9u1).
0 upgraded, 0 newly installed, 0 to remove and 22 not upgraded.
2020-10-17 20:57:41 [INFO]: Starting network-manager service...
2020-10-17 20:57:41 [INFO]: Successfully installed required packages: apt-transport-https ca-certificates curl software-properties-common network-manager unzip lsb-re
lease dirmngr jq !
2020-10-17 20:57:41 [INFO]: Disabling swap ...
2020-10-17 20:57:41 [INFO]: Start installing docker and nvidia-docker2 ...
2020-10-17 20:57:41 [INFO]: 'proteeti_prova' is already added to docker group. Skipping docker group configuration ...
2020-10-17 20:57:41 [INFO]: Skipping nvidia-docker install since it is already present.
WARNING: No swap limit support
2020-10-17 20:57:42 [INFO]: Docker Compose version 1.25.4 is already installed. Skipping docker-compose installation...
2020-10-17 20:57:42 [INFO]: The following versions of k8s components are already installed.
Error from server (NotFound): the server could not find the requested resource
2020-10-17 20:57:43 [INFO]: - kubectl: Client Version: v1.15.4
2020-10-17 20:57:43 [INFO]: - kubelet: Kubernetes v1.15.4
2020-10-17 20:57:44 [INFO]: - kubeadm: v1.15.4
2020-10-17 20:57:45 [INFO]: Skipping Kubernetes installation (version: 1.15.4-00) since Kubernetes is already present.
error: the server doesn't have a resource type "pods"

1. 实例:

GCP, Ubuntu 18.04
n1-standard-16 (16 vCPUs, 60 GB memory)
1 x NVIDIA Tesla T4

2.正在下载bootstrap,解压:

$curl -LO https://api.ngc.nvidia.com/v2/resources/nvidia/clara/clara_bootstrap/versions/0.7.1-2008.1/files/bootstrap.zip
$unzip bootstrap.zip -d bootstrap

3. 安装 cuda 作为先决条件并重新启动:

$wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
$sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
$wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
$sudo dpkg -i cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
$sudo apt-key add /var/cuda-repo-ubuntu1804-11-1-local/7fa2af80.pub
$sudo apt-get update
$sudo apt-get -y install cuda
$sudo reboot

4.重启后启用IP Forwarding:

$sudo -s
#echo 1 > /proc/sys/net/ipv4/ip_forward

5. 运行 bootstrap.sh(第一次).

kubelet.service 显示 code=exited, status=255 错误:

$sudo ./bootstrap/bootstrap.sh
...
...
● kubelet.service - kubelet: The Kubernetes Node Agent
       Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
      Drop-In: /etc/systemd/system/kubelet.service.d
               └─10-kubeadm.conf
       Active: activating (auto-restart) (Result: exit-code) since Mon 2020-10-19 10:40:54 UTC; 2s ago
         Docs: https://kubernetes.io/docs/home/
      Process: 2356 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=255)
     Main PID: 2356 (code=exited, status=255)

此错误意味着您应该 运行 kubeadm init 手动。因此,运行 kubeadm init --pod-network-cidr=10.244.0.0/16 然后再次检查 sudo service kubelet status 以确保它如预期的那样 运行ning。所有 kubernetes 配置都将在 kubeadm init --pod-network-cidr=10.244.0.0/16.

期间为您生成

6. 我们添加--pod-network-cidr=10.244.0.0/16 因为我们将使用Flannel CNI。您可以在 bootstrap.sh、第 334 行 if ! sudo kubeadm init --pod-network-cidr="10.244.0.0/16"; then

中查看相同内容
$ sudo kubeadm init --pod-network-cidr=10.244.0.0/16
[init] Using Kubernetes version: v1.15.12
[preflight] Pulling images required for setting up a Kubernetes cluster
...
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Activating the kubelet service
...
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
...
[apiclient] All control plane components are healthy after 19.501975 seconds
...
Your Kubernetes control-plane has initialized successfully!.
...
$ sudo service kubelet status
    ● kubelet.service - kubelet: The Kubernetes Node Agent
       Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
      Drop-In: /etc/systemd/system/kubelet.service.d
               └─10-kubeadm.conf
       Active: active (running) since Mon 2020-10-19 13:42:22 UTC; 4min 15s ago

7. 接下来是常规步骤,可以让您的用户 运行 kubectl 命令而不是 root

$mkdir -p $HOME/.kube
$sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
$sudo chown $(id -u):$(id -g) $HOME/.kube/config

8. 显示当前安装的所有内容

$ kubectl get all -A
NAMESPACE     NAME                                READY   STATUS    RESTARTS   AGE
kube-system   pod/coredns-5c98db65d4-cpz4s        0/1     Pending   0          4m17s
kube-system   pod/coredns-5c98db65d4-kgzg8        0/1     Pending   0          4m17s
kube-system   pod/etcd-clara                      1/1     Running   0          3m10s
kube-system   pod/kube-apiserver-clara            1/1     Running   0          3m35s
kube-system   pod/kube-controller-manager-clara   1/1     Running   0          3m17s
kube-system   pod/kube-proxy-8qx4z                1/1     Running   0          4m18s
kube-system   pod/kube-scheduler-clara            1/1     Running   0          3m23s
    
    
NAMESPACE     NAME                 TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
default       service/kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP                  4m35s
kube-system   service/kube-dns     ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   4m34s
    
NAMESPACE     NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                 AGE
kube-system   daemonset.apps/kube-proxy   1         1         1       1            1           beta.kubernetes.io/os=linux   4m33s
    
NAMESPACE     NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/coredns   0/2     2            0           4m34s
    
NAMESPACE     NAME                                 DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/coredns-5c98db65d4   2         2         0       4m18s

请注意:目前 coredns pods 处于 Pending 状态。您还可以看到未准备好 coredns deploymentreplicaset

NAMESPACE     NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/coredns   0/2     2            0           4m34s
    
NAMESPACE     NAME                                 DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/coredns-5c98db65d4   2         2         0       4m18s

他们正在等待您应用 flannel 配置 yaml。 这些是来自同一脚本的行

info "Deploy kubernetes pod network."
sudo kubectl apply -f $SCRIPT_DIR/kube-flannel.yml
sudo kubectl apply -f $SCRIPT_DIR/kube-flannel-rbac.yml

如果您此时不执行此操作并重新运行脚本 - 您将收到超时错误

2020-10-19 14:14:03 [INFO]: coredns pods are not running yet ...

9. 部署 Flannel

$ kubectl apply -f bootstrap/kube-flannel.yml
podsecuritypolicy.extensions/psp.flannel.unprivileged created
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created
serviceaccount/flannel created
configmap/kube-flannel-cfg created
daemonset.extensions/kube-flannel-ds-amd64 created
daemonset.extensions/kube-flannel-ds-arm64 created
daemonset.extensions/kube-flannel-ds-arm created
daemonset.extensions/kube-flannel-ds-ppc64le created
daemonset.extensions/kube-flannel-ds-s390x created
    
$ kubectl apply -f bootstrap/kube-flannel-rbac.yml
clusterrole.rbac.authorization.k8s.io/flannel configured
clusterrolebinding.rbac.authorization.k8s.io/flannel unchanged

之后,与 coredns 相关的所有内容都会立即开始工作。 Pods 将被创建并处于 Running 状态,deploymentreplicaset 将处于正确的状态。

NAMESPACE     NAME                                READY   STATUS    RESTARTS   AGE
kube-system   pod/coredns-5c98db65d4-cpz4s        1/1     Running   0          21m
kube-system   pod/coredns-5c98db65d4-kgzg8        1/1     Running   0          21m
    
NAMESPACE     NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/coredns   2/2     2            2           21m
    
NAMESPACE     NAME                                 DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/coredns-5c98db65d4   2         2         2       21m

此外,您还会看到与法兰绒相关的新 poddaemonsets

kube-system   pod/kube-flannel-ds-amd64-64jbv     1/1     Running   0          3m59s
    
    
NAMESPACE     NAME                                     DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
kube-system   daemonset.apps/kube-flannel-ds-amd64     1         1         1       1            1           beta.kubernetes.io/arch=amd64     3m59s
kube-system   daemonset.apps/kube-flannel-ds-arm       0         0         0       0            0           beta.kubernetes.io/arch=arm       3m59s
kube-system   daemonset.apps/kube-flannel-ds-arm64     0         0         0       0            0           beta.kubernetes.io/arch=arm64     3m59s
kube-system   daemonset.apps/kube-flannel-ds-ppc64le   0         0         0       0            0           beta.kubernetes.io/arch=ppc64le   3m59s
kube-system   daemonset.apps/kube-flannel-ds-s390x     0         0         0       0            0           beta.kubernetes.io/arch=s390x     3m59s

10. 最后是时候继续 运行ning 脚本了。它会尝试!安装 helmtiller 并重新启动 dockerd。一切都很好,除了 TILLER...

$sudo ./bootstrap/bootstrap.sh
[INFO]: Clara Deploy SDK System Prerequisites Installation
...
Skipping Kubernetes installation (version: 1.15.4-00) since Kubernetes is already present.
./bootstrap/bootstrap.sh: line 412: helm: command not found
...
[INFO]: Start installing helm ...
...
[INFO]: Restarting dockerd...
The connection to the server *.*.*.*:6443 was refused - did you specify the right host or port?
[INFO]: Waiting for Kubernetes to be ready...
Kubernetes master is running at https://*.*.*.*:6443
KubeDNS is running at https://*.*.*.*:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
...
[INFO]: Updating permissions...
[INFO]: tiller pod is not started yet ...
[INFO]: tiller pod is not started yet ...
[INFO]: tiller pod is not started yet ...

11. 我们没有 Tiller pod。结果部署和复制集也被破坏了...

kube-system   deployment.apps/tiller-deploy   0/1  0 0 7m26s
kube-system   replicaset.apps/tiller-deploy-659c6788f5   1 0 0 7m26s

我在这里看不到任何其他解决方案,而是手动删除 tiller 的相关组件(部署、服务)并从头开始重新安装..使用一些小的解决方法..

#delete tiller
$kubectl delete deployment tiller-deploy -n kube-system
$kubectl delete deployment tiller-deploy -n kube-system
    
#install helm,tiller
$curl https://raw.githubusercontent.com/helm/helm/master/scripts/get | bash
$kubectl create serviceaccount --namespace kube-system tiller
$kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
$helm init --service-account tiller

现在,如果您要检查已部署的内容 - 您会清楚地看到 tiller-pod 处于待定状态,就像 tiller-deploy 部署尚未就绪

NAMESPACE     NAME                                 READY   STATUS    RESTARTS   AGE
kube-system   pod/tiller-deploy-67847cd9b9-vlzm6   0/1     Pending   0          11m
    
NAMESPACE     NAME                            READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/tiller-deploy   0/1     1            0           11m
    
NAMESPACE     NAME                                       DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/tiller-deploy-67847cd9b9   1         1         0       11m

12.固定舵机

让我们描述 tiller pod 并找到 tolerations

$ kubectl describe pod tiller-deploy-67847cd9b9-vlzm6 -n kube-system
    Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                     node.kubernetes.io/unreachable:NoExecute for 300s

我不会解释为什么(你会自己读到容忍度),但修复是允许主人 运行 pods...

$kubectl taint nodes --all node-role.kubernetes.io/master-

之后你会看到

NAMESPACE     NAME                                 READY   STATUS    RESTARTS   AGE
kube-system   pod/tiller-deploy-67847cd9b9-vlzm6   1/1     Running   0          13m
    
NAMESPACE     NAME                            READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/tiller-deploy   1/1     1            1           13m
    
NAMESPACE     NAME                                       DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/tiller-deploy-67847cd9b9   1         1         1       13m

13. 接下来,安装所有组件:

$curl -LO https://api.ngc.nvidia.com/v2/resources/nvidia/clara/clara_cli/versions/0.7.1-2008.1/files/cli.zip
$sudo unzip cli.zip -d /usr/bin/ && sudo chmod 755 /usr/bin/clara*
    
$ clara version
Clara CLI version: 0.7.1-12788.ae65aea0
$ clara config --key KEY --orgteam nvidia/clara -y
Configuration "ngc-clara"successfully created
    
$ clara pull platform
Clara Platform 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara
    
$ clara platform start
Starting clara...
NAME:   clara
    
$ clara pull dicom
Clara Dicom Adapter 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/dicom-adapter
    
$ clara pull render
Clara Renderer 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara-renderer
    
$ clara pull monitor
Clara Monitor Server 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara-monitor-server
    
$ clara pull console
Clara Management Console 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara-console
    
$ clara dicom start
Starting DICOM Adapter...
NAME: clara-dicom-adapter
$ clara render start
NAME: clara-render-server
$ clara monitor start
NAME: clara-monitor-server
$ clara console start
NAME: clara-console

14.为了验证安装是否成功,运行以下命令:

$ helm ls
NAME                    REVISION        UPDATED                         STATUS          CHART                                   APP VERSION     NAMESPACE
clara                   1               Mon Oct 19 16:16:36 2020        DEPLOYED        clara-0.7.1-2008.1                      1.0             default  
clara-console           1               Mon Oct 19 16:28:30 2020        DEPLOYED        clara-console-0.7.1-2008.1              1.0             default  
clara-dicom-adapter     1               Mon Oct 19 16:22:36 2020        DEPLOYED        dicom-adapter-0.7.1-2008.1              1.0             default  
clara-monitor-server    1               Mon Oct 19 16:26:35 2020        DEPLOYED        clara-monitor-server-0.7.1-2008.1       1.0             default  
clara-render-server     1               Mon Oct 19 16:22:54 2020        DEPLOYED        clara-renderer-0.7.1-2008.1             1.0             default  
    
    
$ kubectl get pods
NAME                                                   READY   STATUS    RESTARTS   AGE
clara-clara-platformapiserver-54c5c44bbd-gqdd6         1/1     Running   0          13m
clara-console-8565b4d565-wcbg5                         2/2     Running   0          2m2s
clara-console-mongodb-85f8bd5f95-ts2gp                 1/1     Running   0          2m2s
clara-dicom-adapter-7948fcd445-mnsjd                   1/1     Running   0          7m56s
clara-monitor-server-fluentd-elasticsearch-6zvhq       1/1     Running   0          3m57s
clara-monitor-server-grafana-5f874b974d-6l4s8          1/1     Running   0          3m57s
clara-monitor-server-monitor-server-59c8bf68f7-5dgxq   1/1     Running   0          3m57s
clara-render-server-clara-renderer-d79dd4779-wcjrv     3/3     Running   0          7m38s
clara-resultsservice-664477898f-9nk4f                  1/1     Running   0          13m
clara-ui-6f89b97df8-792f6                              1/1     Running   0          13m
clara-workflow-controller-69cbb55fc8-zjhdm             1/1     Running   0          13m
elasticsearch-master-0                                 1/1     Running   0          3m57s
elasticsearch-master-1                                 1/1     Running   0          3m57s
fluentd-km8nj                                          1/1     Running   0          13m

P.S。当然,为您修复脚本要容易得多,但我决定向您展示后台发生的事情。如果需要,我相信你会自己做。