"server doesn't have a resource type "pods"" 在安装 NVIDIA Clara Deploy 时
"server doesn't have a resource type "pods"" while installing NVIDIA Clara Deploy
我正在尝试按照官方文档 (this & this) 安装最新版本的 NVIDIA Clara Deploy Bootstrap。在安装的一个步骤中,这是一个名为“bootstrap.sh”的 shellscript - 用于安装所有依赖项,包括 Kubernetes 和 kubectl,以及集群创建。但是在 运行ning sudo ./bootstrap.sh
之后,我收到了这个错误:error: the server doesn't have a resource type "pods"
.
到目前为止我做了什么:
我是 Kubernetes 的新手。所以我尝试了 this answer 的解决方案,尝试了 运行 kubectl get pods
,这给了我 No resources found.
。我也试过 kubectl auth can-i get pods
,这给了我 yes
。在 etc/kubernetes/manifests 里面,它是空的,应该有我从答案中看到的 conf 文件,所以我 运行 sudo kubeadm init
.
这是完整的错误信息:
2020-10-17 20:57:37 [INFO]: Clara Deploy SDK System Prerequisites Installation
2020-10-17 20:57:37 [INFO]: Checking user privilege...
2020-10-17 20:57:37 [INFO]: Checking for NVIDIA GPU driver...
2020-10-17 20:57:37 [INFO]: NVIDIA CUDA driver version found: 418.87.01
2020-10-17 20:57:37 [INFO]: NVIDIA GPU driver found
2020-10-17 20:57:37 [INFO]: Check and install required packages: apt-transport-https ca-certificates curl software-properties-common network-manager unzip lsb-release
dirmngr jq ...
Ign:1 http://deb.debian.org/debian stretch InRelease
Get:2 http://security.debian.org stretch/updates InRelease [53.0 kB]
Get:3 http://deb.debian.org/debian stretch-updates InRelease [93.6 kB]
Get:4 http://deb.debian.org/debian stretch-backports InRelease [91.8 kB]
Hit:5 http://deb.debian.org/debian stretch Release
Hit:6 http://packages.cloud.google.com/apt gcsfuse-stretch InRelease
Get:7 https://download.docker.com/linux/debian stretch InRelease [44.8 kB]
Get:8 http://packages.cloud.google.com/apt cloud-sdk-stretch InRelease [6,389 B]
Get:9 http://security.debian.org stretch/updates/main Sources [263 kB]
Hit:10 http://packages.cloud.google.com/apt google-compute-engine-stretch-stable InRelease
Get:11 http://security.debian.org stretch/updates/main amd64 Packages [604 kB]
Get:12 http://security.debian.org stretch/updates/main Translation-en [267 kB]
Hit:13 http://packages.cloud.google.com/apt google-cloud-packages-archive-keyring-stretch InRelease
Hit:14 https://nvidia.github.io/libnvidia-container/stable/debian9/amd64 InRelease
Hit:16 https://nvidia.github.io/nvidia-container-runtime/stable/debian9/amd64 InRelease
Hit:15 https://packages.cloud.google.com/apt kubernetes-xenial InRelease
Hit:18 https://nvidia.github.io/nvidia-docker/debian9/amd64 InRelease
Fetched 1,424 kB in 1s (1,175 kB/s)
Reading package lists... Done
Reading package lists... Done
Building dependency tree
Reading state information... Done
apt-transport-https is already the newest version (1.4.10).
ca-certificates is already the newest version (20200601~deb9u1).
dirmngr is already the newest version (2.1.18-8~deb9u4).
jq is already the newest version (1.5+dfsg-1.3).
lsb-release is already the newest version (9.20161125).
network-manager is already the newest version (1.6.2-3+deb9u2).
unzip is already the newest version (6.0-21+deb9u2).
curl is already the newest version (7.52.1-5+deb9u12).
software-properties-common is already the newest version (0.96.20.2-1+deb9u1).
0 upgraded, 0 newly installed, 0 to remove and 22 not upgraded.
2020-10-17 20:57:41 [INFO]: Starting network-manager service...
2020-10-17 20:57:41 [INFO]: Successfully installed required packages: apt-transport-https ca-certificates curl software-properties-common network-manager unzip lsb-re
lease dirmngr jq !
2020-10-17 20:57:41 [INFO]: Disabling swap ...
2020-10-17 20:57:41 [INFO]: Start installing docker and nvidia-docker2 ...
2020-10-17 20:57:41 [INFO]: 'proteeti_prova' is already added to docker group. Skipping docker group configuration ...
2020-10-17 20:57:41 [INFO]: Skipping nvidia-docker install since it is already present.
WARNING: No swap limit support
2020-10-17 20:57:42 [INFO]: Docker Compose version 1.25.4 is already installed. Skipping docker-compose installation...
2020-10-17 20:57:42 [INFO]: The following versions of k8s components are already installed.
Error from server (NotFound): the server could not find the requested resource
2020-10-17 20:57:43 [INFO]: - kubectl: Client Version: v1.15.4
2020-10-17 20:57:43 [INFO]: - kubelet: Kubernetes v1.15.4
2020-10-17 20:57:44 [INFO]: - kubeadm: v1.15.4
2020-10-17 20:57:45 [INFO]: Skipping Kubernetes installation (version: 1.15.4-00) since Kubernetes is already present.
error: the server doesn't have a resource type "pods"
1. 实例:
GCP, Ubuntu 18.04
n1-standard-16 (16 vCPUs, 60 GB memory)
1 x NVIDIA Tesla T4
2.正在下载bootstrap,解压:
$curl -LO https://api.ngc.nvidia.com/v2/resources/nvidia/clara/clara_bootstrap/versions/0.7.1-2008.1/files/bootstrap.zip
$unzip bootstrap.zip -d bootstrap
3. 安装 cuda 作为先决条件并重新启动:
$wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
$sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
$wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
$sudo dpkg -i cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
$sudo apt-key add /var/cuda-repo-ubuntu1804-11-1-local/7fa2af80.pub
$sudo apt-get update
$sudo apt-get -y install cuda
$sudo reboot
4.重启后启用IP Forwarding:
$sudo -s
#echo 1 > /proc/sys/net/ipv4/ip_forward
5. 运行 bootstrap.sh
(第一次).
kubelet.service
显示 code=exited, status=255
错误:
$sudo ./bootstrap/bootstrap.sh
...
...
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: activating (auto-restart) (Result: exit-code) since Mon 2020-10-19 10:40:54 UTC; 2s ago
Docs: https://kubernetes.io/docs/home/
Process: 2356 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=255)
Main PID: 2356 (code=exited, status=255)
此错误意味着您应该 运行 kubeadm init
手动。因此,运行 kubeadm init --pod-network-cidr=10.244.0.0/16
然后再次检查 sudo service kubelet status
以确保它如预期的那样 运行ning。所有 kubernetes 配置都将在 kubeadm init --pod-network-cidr=10.244.0.0/16
.
期间为您生成
6. 我们添加--pod-network-cidr=10.244.0.0/16
因为我们将使用Flannel CNI。您可以在 bootstrap.sh
、第 334 行 if ! sudo kubeadm init --pod-network-cidr="10.244.0.0/16"; then
中查看相同内容
$ sudo kubeadm init --pod-network-cidr=10.244.0.0/16
[init] Using Kubernetes version: v1.15.12
[preflight] Pulling images required for setting up a Kubernetes cluster
...
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Activating the kubelet service
...
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
...
[apiclient] All control plane components are healthy after 19.501975 seconds
...
Your Kubernetes control-plane has initialized successfully!.
...
$ sudo service kubelet status
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Mon 2020-10-19 13:42:22 UTC; 4min 15s ago
7. 接下来是常规步骤,可以让您的用户 运行 kubectl 命令而不是 root
$mkdir -p $HOME/.kube
$sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
$sudo chown $(id -u):$(id -g) $HOME/.kube/config
8. 显示当前安装的所有内容
$ kubectl get all -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system pod/coredns-5c98db65d4-cpz4s 0/1 Pending 0 4m17s
kube-system pod/coredns-5c98db65d4-kgzg8 0/1 Pending 0 4m17s
kube-system pod/etcd-clara 1/1 Running 0 3m10s
kube-system pod/kube-apiserver-clara 1/1 Running 0 3m35s
kube-system pod/kube-controller-manager-clara 1/1 Running 0 3m17s
kube-system pod/kube-proxy-8qx4z 1/1 Running 0 4m18s
kube-system pod/kube-scheduler-clara 1/1 Running 0 3m23s
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 4m35s
kube-system service/kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 4m34s
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-system daemonset.apps/kube-proxy 1 1 1 1 1 beta.kubernetes.io/os=linux 4m33s
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/coredns 0/2 2 0 4m34s
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/coredns-5c98db65d4 2 2 0 4m18s
请注意:目前 coredns pods
处于 Pending
状态。您还可以看到未准备好 coredns deployment
和 replicaset
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/coredns 0/2 2 0 4m34s
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/coredns-5c98db65d4 2 2 0 4m18s
他们正在等待您应用 flannel 配置 yaml。
这些是来自同一脚本的行
info "Deploy kubernetes pod network."
sudo kubectl apply -f $SCRIPT_DIR/kube-flannel.yml
sudo kubectl apply -f $SCRIPT_DIR/kube-flannel-rbac.yml
如果您此时不执行此操作并重新运行脚本 - 您将收到超时错误
2020-10-19 14:14:03 [INFO]: coredns pods are not running yet ...
9. 部署 Flannel
$ kubectl apply -f bootstrap/kube-flannel.yml
podsecuritypolicy.extensions/psp.flannel.unprivileged created
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created
serviceaccount/flannel created
configmap/kube-flannel-cfg created
daemonset.extensions/kube-flannel-ds-amd64 created
daemonset.extensions/kube-flannel-ds-arm64 created
daemonset.extensions/kube-flannel-ds-arm created
daemonset.extensions/kube-flannel-ds-ppc64le created
daemonset.extensions/kube-flannel-ds-s390x created
$ kubectl apply -f bootstrap/kube-flannel-rbac.yml
clusterrole.rbac.authorization.k8s.io/flannel configured
clusterrolebinding.rbac.authorization.k8s.io/flannel unchanged
之后,与 coredns
相关的所有内容都会立即开始工作。 Pods
将被创建并处于 Running
状态,deployment
和 replicaset
将处于正确的状态。
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system pod/coredns-5c98db65d4-cpz4s 1/1 Running 0 21m
kube-system pod/coredns-5c98db65d4-kgzg8 1/1 Running 0 21m
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/coredns 2/2 2 2 21m
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/coredns-5c98db65d4 2 2 2 21m
此外,您还会看到与法兰绒相关的新 pod
和 daemonsets
kube-system pod/kube-flannel-ds-amd64-64jbv 1/1 Running 0 3m59s
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-system daemonset.apps/kube-flannel-ds-amd64 1 1 1 1 1 beta.kubernetes.io/arch=amd64 3m59s
kube-system daemonset.apps/kube-flannel-ds-arm 0 0 0 0 0 beta.kubernetes.io/arch=arm 3m59s
kube-system daemonset.apps/kube-flannel-ds-arm64 0 0 0 0 0 beta.kubernetes.io/arch=arm64 3m59s
kube-system daemonset.apps/kube-flannel-ds-ppc64le 0 0 0 0 0 beta.kubernetes.io/arch=ppc64le 3m59s
kube-system daemonset.apps/kube-flannel-ds-s390x 0 0 0 0 0 beta.kubernetes.io/arch=s390x 3m59s
10. 最后是时候继续 运行ning 脚本了。它会尝试!安装 helm
、tiller
并重新启动 dockerd
。一切都很好,除了 TILLER
...
$sudo ./bootstrap/bootstrap.sh
[INFO]: Clara Deploy SDK System Prerequisites Installation
...
Skipping Kubernetes installation (version: 1.15.4-00) since Kubernetes is already present.
./bootstrap/bootstrap.sh: line 412: helm: command not found
...
[INFO]: Start installing helm ...
...
[INFO]: Restarting dockerd...
The connection to the server *.*.*.*:6443 was refused - did you specify the right host or port?
[INFO]: Waiting for Kubernetes to be ready...
Kubernetes master is running at https://*.*.*.*:6443
KubeDNS is running at https://*.*.*.*:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
...
[INFO]: Updating permissions...
[INFO]: tiller pod is not started yet ...
[INFO]: tiller pod is not started yet ...
[INFO]: tiller pod is not started yet ...
11. 我们没有 Tiller pod。结果部署和复制集也被破坏了...
kube-system deployment.apps/tiller-deploy 0/1 0 0 7m26s
kube-system replicaset.apps/tiller-deploy-659c6788f5 1 0 0 7m26s
我在这里看不到任何其他解决方案,而是手动删除 tiller 的相关组件(部署、服务)并从头开始重新安装..使用一些小的解决方法..
#delete tiller
$kubectl delete deployment tiller-deploy -n kube-system
$kubectl delete deployment tiller-deploy -n kube-system
#install helm,tiller
$curl https://raw.githubusercontent.com/helm/helm/master/scripts/get | bash
$kubectl create serviceaccount --namespace kube-system tiller
$kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
$helm init --service-account tiller
现在,如果您要检查已部署的内容 - 您会清楚地看到 tiller-pod
处于待定状态,就像 tiller-deploy
部署尚未就绪
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system pod/tiller-deploy-67847cd9b9-vlzm6 0/1 Pending 0 11m
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/tiller-deploy 0/1 1 0 11m
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/tiller-deploy-67847cd9b9 1 1 0 11m
12.固定舵机
让我们描述 tiller pod 并找到 tolerations
$ kubectl describe pod tiller-deploy-67847cd9b9-vlzm6 -n kube-system
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
我不会解释为什么(你会自己读到容忍度),但修复是允许主人 运行 pods...
$kubectl taint nodes --all node-role.kubernetes.io/master-
之后你会看到
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system pod/tiller-deploy-67847cd9b9-vlzm6 1/1 Running 0 13m
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/tiller-deploy 1/1 1 1 13m
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/tiller-deploy-67847cd9b9 1 1 1 13m
13. 接下来,安装所有组件:
$curl -LO https://api.ngc.nvidia.com/v2/resources/nvidia/clara/clara_cli/versions/0.7.1-2008.1/files/cli.zip
$sudo unzip cli.zip -d /usr/bin/ && sudo chmod 755 /usr/bin/clara*
$ clara version
Clara CLI version: 0.7.1-12788.ae65aea0
$ clara config --key KEY --orgteam nvidia/clara -y
Configuration "ngc-clara"successfully created
$ clara pull platform
Clara Platform 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara
$ clara platform start
Starting clara...
NAME: clara
$ clara pull dicom
Clara Dicom Adapter 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/dicom-adapter
$ clara pull render
Clara Renderer 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara-renderer
$ clara pull monitor
Clara Monitor Server 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara-monitor-server
$ clara pull console
Clara Management Console 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara-console
$ clara dicom start
Starting DICOM Adapter...
NAME: clara-dicom-adapter
$ clara render start
NAME: clara-render-server
$ clara monitor start
NAME: clara-monitor-server
$ clara console start
NAME: clara-console
14.为了验证安装是否成功,运行以下命令:
$ helm ls
NAME REVISION UPDATED STATUS CHART APP VERSION NAMESPACE
clara 1 Mon Oct 19 16:16:36 2020 DEPLOYED clara-0.7.1-2008.1 1.0 default
clara-console 1 Mon Oct 19 16:28:30 2020 DEPLOYED clara-console-0.7.1-2008.1 1.0 default
clara-dicom-adapter 1 Mon Oct 19 16:22:36 2020 DEPLOYED dicom-adapter-0.7.1-2008.1 1.0 default
clara-monitor-server 1 Mon Oct 19 16:26:35 2020 DEPLOYED clara-monitor-server-0.7.1-2008.1 1.0 default
clara-render-server 1 Mon Oct 19 16:22:54 2020 DEPLOYED clara-renderer-0.7.1-2008.1 1.0 default
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
clara-clara-platformapiserver-54c5c44bbd-gqdd6 1/1 Running 0 13m
clara-console-8565b4d565-wcbg5 2/2 Running 0 2m2s
clara-console-mongodb-85f8bd5f95-ts2gp 1/1 Running 0 2m2s
clara-dicom-adapter-7948fcd445-mnsjd 1/1 Running 0 7m56s
clara-monitor-server-fluentd-elasticsearch-6zvhq 1/1 Running 0 3m57s
clara-monitor-server-grafana-5f874b974d-6l4s8 1/1 Running 0 3m57s
clara-monitor-server-monitor-server-59c8bf68f7-5dgxq 1/1 Running 0 3m57s
clara-render-server-clara-renderer-d79dd4779-wcjrv 3/3 Running 0 7m38s
clara-resultsservice-664477898f-9nk4f 1/1 Running 0 13m
clara-ui-6f89b97df8-792f6 1/1 Running 0 13m
clara-workflow-controller-69cbb55fc8-zjhdm 1/1 Running 0 13m
elasticsearch-master-0 1/1 Running 0 3m57s
elasticsearch-master-1 1/1 Running 0 3m57s
fluentd-km8nj 1/1 Running 0 13m
P.S。当然,为您修复脚本要容易得多,但我决定向您展示后台发生的事情。如果需要,我相信你会自己做。
我正在尝试按照官方文档 (this & this) 安装最新版本的 NVIDIA Clara Deploy Bootstrap。在安装的一个步骤中,这是一个名为“bootstrap.sh”的 shellscript - 用于安装所有依赖项,包括 Kubernetes 和 kubectl,以及集群创建。但是在 运行ning sudo ./bootstrap.sh
之后,我收到了这个错误:error: the server doesn't have a resource type "pods"
.
到目前为止我做了什么:
我是 Kubernetes 的新手。所以我尝试了 this answer 的解决方案,尝试了 运行 kubectl get pods
,这给了我 No resources found.
。我也试过 kubectl auth can-i get pods
,这给了我 yes
。在 etc/kubernetes/manifests 里面,它是空的,应该有我从答案中看到的 conf 文件,所以我 运行 sudo kubeadm init
.
这是完整的错误信息:
2020-10-17 20:57:37 [INFO]: Clara Deploy SDK System Prerequisites Installation
2020-10-17 20:57:37 [INFO]: Checking user privilege...
2020-10-17 20:57:37 [INFO]: Checking for NVIDIA GPU driver...
2020-10-17 20:57:37 [INFO]: NVIDIA CUDA driver version found: 418.87.01
2020-10-17 20:57:37 [INFO]: NVIDIA GPU driver found
2020-10-17 20:57:37 [INFO]: Check and install required packages: apt-transport-https ca-certificates curl software-properties-common network-manager unzip lsb-release
dirmngr jq ...
Ign:1 http://deb.debian.org/debian stretch InRelease
Get:2 http://security.debian.org stretch/updates InRelease [53.0 kB]
Get:3 http://deb.debian.org/debian stretch-updates InRelease [93.6 kB]
Get:4 http://deb.debian.org/debian stretch-backports InRelease [91.8 kB]
Hit:5 http://deb.debian.org/debian stretch Release
Hit:6 http://packages.cloud.google.com/apt gcsfuse-stretch InRelease
Get:7 https://download.docker.com/linux/debian stretch InRelease [44.8 kB]
Get:8 http://packages.cloud.google.com/apt cloud-sdk-stretch InRelease [6,389 B]
Get:9 http://security.debian.org stretch/updates/main Sources [263 kB]
Hit:10 http://packages.cloud.google.com/apt google-compute-engine-stretch-stable InRelease
Get:11 http://security.debian.org stretch/updates/main amd64 Packages [604 kB]
Get:12 http://security.debian.org stretch/updates/main Translation-en [267 kB]
Hit:13 http://packages.cloud.google.com/apt google-cloud-packages-archive-keyring-stretch InRelease
Hit:14 https://nvidia.github.io/libnvidia-container/stable/debian9/amd64 InRelease
Hit:16 https://nvidia.github.io/nvidia-container-runtime/stable/debian9/amd64 InRelease
Hit:15 https://packages.cloud.google.com/apt kubernetes-xenial InRelease
Hit:18 https://nvidia.github.io/nvidia-docker/debian9/amd64 InRelease
Fetched 1,424 kB in 1s (1,175 kB/s)
Reading package lists... Done
Reading package lists... Done
Building dependency tree
Reading state information... Done
apt-transport-https is already the newest version (1.4.10).
ca-certificates is already the newest version (20200601~deb9u1).
dirmngr is already the newest version (2.1.18-8~deb9u4).
jq is already the newest version (1.5+dfsg-1.3).
lsb-release is already the newest version (9.20161125).
network-manager is already the newest version (1.6.2-3+deb9u2).
unzip is already the newest version (6.0-21+deb9u2).
curl is already the newest version (7.52.1-5+deb9u12).
software-properties-common is already the newest version (0.96.20.2-1+deb9u1).
0 upgraded, 0 newly installed, 0 to remove and 22 not upgraded.
2020-10-17 20:57:41 [INFO]: Starting network-manager service...
2020-10-17 20:57:41 [INFO]: Successfully installed required packages: apt-transport-https ca-certificates curl software-properties-common network-manager unzip lsb-re
lease dirmngr jq !
2020-10-17 20:57:41 [INFO]: Disabling swap ...
2020-10-17 20:57:41 [INFO]: Start installing docker and nvidia-docker2 ...
2020-10-17 20:57:41 [INFO]: 'proteeti_prova' is already added to docker group. Skipping docker group configuration ...
2020-10-17 20:57:41 [INFO]: Skipping nvidia-docker install since it is already present.
WARNING: No swap limit support
2020-10-17 20:57:42 [INFO]: Docker Compose version 1.25.4 is already installed. Skipping docker-compose installation...
2020-10-17 20:57:42 [INFO]: The following versions of k8s components are already installed.
Error from server (NotFound): the server could not find the requested resource
2020-10-17 20:57:43 [INFO]: - kubectl: Client Version: v1.15.4
2020-10-17 20:57:43 [INFO]: - kubelet: Kubernetes v1.15.4
2020-10-17 20:57:44 [INFO]: - kubeadm: v1.15.4
2020-10-17 20:57:45 [INFO]: Skipping Kubernetes installation (version: 1.15.4-00) since Kubernetes is already present.
error: the server doesn't have a resource type "pods"
1. 实例:
GCP, Ubuntu 18.04
n1-standard-16 (16 vCPUs, 60 GB memory)
1 x NVIDIA Tesla T4
2.正在下载bootstrap,解压:
$curl -LO https://api.ngc.nvidia.com/v2/resources/nvidia/clara/clara_bootstrap/versions/0.7.1-2008.1/files/bootstrap.zip
$unzip bootstrap.zip -d bootstrap
3. 安装 cuda 作为先决条件并重新启动:
$wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
$sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
$wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
$sudo dpkg -i cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
$sudo apt-key add /var/cuda-repo-ubuntu1804-11-1-local/7fa2af80.pub
$sudo apt-get update
$sudo apt-get -y install cuda
$sudo reboot
4.重启后启用IP Forwarding:
$sudo -s
#echo 1 > /proc/sys/net/ipv4/ip_forward
5. 运行 bootstrap.sh
(第一次).
kubelet.service
显示 code=exited, status=255
错误:
$sudo ./bootstrap/bootstrap.sh
...
...
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: activating (auto-restart) (Result: exit-code) since Mon 2020-10-19 10:40:54 UTC; 2s ago
Docs: https://kubernetes.io/docs/home/
Process: 2356 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=255)
Main PID: 2356 (code=exited, status=255)
此错误意味着您应该 运行 kubeadm init
手动。因此,运行 kubeadm init --pod-network-cidr=10.244.0.0/16
然后再次检查 sudo service kubelet status
以确保它如预期的那样 运行ning。所有 kubernetes 配置都将在 kubeadm init --pod-network-cidr=10.244.0.0/16
.
6. 我们添加--pod-network-cidr=10.244.0.0/16
因为我们将使用Flannel CNI。您可以在 bootstrap.sh
、第 334 行 if ! sudo kubeadm init --pod-network-cidr="10.244.0.0/16"; then
$ sudo kubeadm init --pod-network-cidr=10.244.0.0/16
[init] Using Kubernetes version: v1.15.12
[preflight] Pulling images required for setting up a Kubernetes cluster
...
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Activating the kubelet service
...
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
...
[apiclient] All control plane components are healthy after 19.501975 seconds
...
Your Kubernetes control-plane has initialized successfully!.
...
$ sudo service kubelet status
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Mon 2020-10-19 13:42:22 UTC; 4min 15s ago
7. 接下来是常规步骤,可以让您的用户 运行 kubectl 命令而不是 root
$mkdir -p $HOME/.kube
$sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
$sudo chown $(id -u):$(id -g) $HOME/.kube/config
8. 显示当前安装的所有内容
$ kubectl get all -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system pod/coredns-5c98db65d4-cpz4s 0/1 Pending 0 4m17s
kube-system pod/coredns-5c98db65d4-kgzg8 0/1 Pending 0 4m17s
kube-system pod/etcd-clara 1/1 Running 0 3m10s
kube-system pod/kube-apiserver-clara 1/1 Running 0 3m35s
kube-system pod/kube-controller-manager-clara 1/1 Running 0 3m17s
kube-system pod/kube-proxy-8qx4z 1/1 Running 0 4m18s
kube-system pod/kube-scheduler-clara 1/1 Running 0 3m23s
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 4m35s
kube-system service/kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 4m34s
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-system daemonset.apps/kube-proxy 1 1 1 1 1 beta.kubernetes.io/os=linux 4m33s
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/coredns 0/2 2 0 4m34s
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/coredns-5c98db65d4 2 2 0 4m18s
请注意:目前 coredns pods
处于 Pending
状态。您还可以看到未准备好 coredns deployment
和 replicaset
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/coredns 0/2 2 0 4m34s
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/coredns-5c98db65d4 2 2 0 4m18s
他们正在等待您应用 flannel 配置 yaml。 这些是来自同一脚本的行
info "Deploy kubernetes pod network."
sudo kubectl apply -f $SCRIPT_DIR/kube-flannel.yml
sudo kubectl apply -f $SCRIPT_DIR/kube-flannel-rbac.yml
如果您此时不执行此操作并重新运行脚本 - 您将收到超时错误
2020-10-19 14:14:03 [INFO]: coredns pods are not running yet ...
9. 部署 Flannel
$ kubectl apply -f bootstrap/kube-flannel.yml
podsecuritypolicy.extensions/psp.flannel.unprivileged created
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created
serviceaccount/flannel created
configmap/kube-flannel-cfg created
daemonset.extensions/kube-flannel-ds-amd64 created
daemonset.extensions/kube-flannel-ds-arm64 created
daemonset.extensions/kube-flannel-ds-arm created
daemonset.extensions/kube-flannel-ds-ppc64le created
daemonset.extensions/kube-flannel-ds-s390x created
$ kubectl apply -f bootstrap/kube-flannel-rbac.yml
clusterrole.rbac.authorization.k8s.io/flannel configured
clusterrolebinding.rbac.authorization.k8s.io/flannel unchanged
之后,与 coredns
相关的所有内容都会立即开始工作。 Pods
将被创建并处于 Running
状态,deployment
和 replicaset
将处于正确的状态。
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system pod/coredns-5c98db65d4-cpz4s 1/1 Running 0 21m
kube-system pod/coredns-5c98db65d4-kgzg8 1/1 Running 0 21m
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/coredns 2/2 2 2 21m
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/coredns-5c98db65d4 2 2 2 21m
此外,您还会看到与法兰绒相关的新 pod
和 daemonsets
kube-system pod/kube-flannel-ds-amd64-64jbv 1/1 Running 0 3m59s
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-system daemonset.apps/kube-flannel-ds-amd64 1 1 1 1 1 beta.kubernetes.io/arch=amd64 3m59s
kube-system daemonset.apps/kube-flannel-ds-arm 0 0 0 0 0 beta.kubernetes.io/arch=arm 3m59s
kube-system daemonset.apps/kube-flannel-ds-arm64 0 0 0 0 0 beta.kubernetes.io/arch=arm64 3m59s
kube-system daemonset.apps/kube-flannel-ds-ppc64le 0 0 0 0 0 beta.kubernetes.io/arch=ppc64le 3m59s
kube-system daemonset.apps/kube-flannel-ds-s390x 0 0 0 0 0 beta.kubernetes.io/arch=s390x 3m59s
10. 最后是时候继续 运行ning 脚本了。它会尝试!安装 helm
、tiller
并重新启动 dockerd
。一切都很好,除了 TILLER
...
$sudo ./bootstrap/bootstrap.sh
[INFO]: Clara Deploy SDK System Prerequisites Installation
...
Skipping Kubernetes installation (version: 1.15.4-00) since Kubernetes is already present.
./bootstrap/bootstrap.sh: line 412: helm: command not found
...
[INFO]: Start installing helm ...
...
[INFO]: Restarting dockerd...
The connection to the server *.*.*.*:6443 was refused - did you specify the right host or port?
[INFO]: Waiting for Kubernetes to be ready...
Kubernetes master is running at https://*.*.*.*:6443
KubeDNS is running at https://*.*.*.*:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
...
[INFO]: Updating permissions...
[INFO]: tiller pod is not started yet ...
[INFO]: tiller pod is not started yet ...
[INFO]: tiller pod is not started yet ...
11. 我们没有 Tiller pod。结果部署和复制集也被破坏了...
kube-system deployment.apps/tiller-deploy 0/1 0 0 7m26s
kube-system replicaset.apps/tiller-deploy-659c6788f5 1 0 0 7m26s
我在这里看不到任何其他解决方案,而是手动删除 tiller 的相关组件(部署、服务)并从头开始重新安装..使用一些小的解决方法..
#delete tiller
$kubectl delete deployment tiller-deploy -n kube-system
$kubectl delete deployment tiller-deploy -n kube-system
#install helm,tiller
$curl https://raw.githubusercontent.com/helm/helm/master/scripts/get | bash
$kubectl create serviceaccount --namespace kube-system tiller
$kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
$helm init --service-account tiller
现在,如果您要检查已部署的内容 - 您会清楚地看到 tiller-pod
处于待定状态,就像 tiller-deploy
部署尚未就绪
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system pod/tiller-deploy-67847cd9b9-vlzm6 0/1 Pending 0 11m
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/tiller-deploy 0/1 1 0 11m
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/tiller-deploy-67847cd9b9 1 1 0 11m
12.固定舵机
让我们描述 tiller pod 并找到 tolerations
$ kubectl describe pod tiller-deploy-67847cd9b9-vlzm6 -n kube-system
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
我不会解释为什么(你会自己读到容忍度),但修复是允许主人 运行 pods...
$kubectl taint nodes --all node-role.kubernetes.io/master-
之后你会看到
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system pod/tiller-deploy-67847cd9b9-vlzm6 1/1 Running 0 13m
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/tiller-deploy 1/1 1 1 13m
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/tiller-deploy-67847cd9b9 1 1 1 13m
13. 接下来,安装所有组件:
$curl -LO https://api.ngc.nvidia.com/v2/resources/nvidia/clara/clara_cli/versions/0.7.1-2008.1/files/cli.zip
$sudo unzip cli.zip -d /usr/bin/ && sudo chmod 755 /usr/bin/clara*
$ clara version
Clara CLI version: 0.7.1-12788.ae65aea0
$ clara config --key KEY --orgteam nvidia/clara -y
Configuration "ngc-clara"successfully created
$ clara pull platform
Clara Platform 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara
$ clara platform start
Starting clara...
NAME: clara
$ clara pull dicom
Clara Dicom Adapter 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/dicom-adapter
$ clara pull render
Clara Renderer 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara-renderer
$ clara pull monitor
Clara Monitor Server 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara-monitor-server
$ clara pull console
Clara Management Console 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara-console
$ clara dicom start
Starting DICOM Adapter...
NAME: clara-dicom-adapter
$ clara render start
NAME: clara-render-server
$ clara monitor start
NAME: clara-monitor-server
$ clara console start
NAME: clara-console
14.为了验证安装是否成功,运行以下命令:
$ helm ls
NAME REVISION UPDATED STATUS CHART APP VERSION NAMESPACE
clara 1 Mon Oct 19 16:16:36 2020 DEPLOYED clara-0.7.1-2008.1 1.0 default
clara-console 1 Mon Oct 19 16:28:30 2020 DEPLOYED clara-console-0.7.1-2008.1 1.0 default
clara-dicom-adapter 1 Mon Oct 19 16:22:36 2020 DEPLOYED dicom-adapter-0.7.1-2008.1 1.0 default
clara-monitor-server 1 Mon Oct 19 16:26:35 2020 DEPLOYED clara-monitor-server-0.7.1-2008.1 1.0 default
clara-render-server 1 Mon Oct 19 16:22:54 2020 DEPLOYED clara-renderer-0.7.1-2008.1 1.0 default
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
clara-clara-platformapiserver-54c5c44bbd-gqdd6 1/1 Running 0 13m
clara-console-8565b4d565-wcbg5 2/2 Running 0 2m2s
clara-console-mongodb-85f8bd5f95-ts2gp 1/1 Running 0 2m2s
clara-dicom-adapter-7948fcd445-mnsjd 1/1 Running 0 7m56s
clara-monitor-server-fluentd-elasticsearch-6zvhq 1/1 Running 0 3m57s
clara-monitor-server-grafana-5f874b974d-6l4s8 1/1 Running 0 3m57s
clara-monitor-server-monitor-server-59c8bf68f7-5dgxq 1/1 Running 0 3m57s
clara-render-server-clara-renderer-d79dd4779-wcjrv 3/3 Running 0 7m38s
clara-resultsservice-664477898f-9nk4f 1/1 Running 0 13m
clara-ui-6f89b97df8-792f6 1/1 Running 0 13m
clara-workflow-controller-69cbb55fc8-zjhdm 1/1 Running 0 13m
elasticsearch-master-0 1/1 Running 0 3m57s
elasticsearch-master-1 1/1 Running 0 3m57s
fluentd-km8nj 1/1 Running 0 13m
P.S。当然,为您修复脚本要容易得多,但我决定向您展示后台发生的事情。如果需要,我相信你会自己做。