在 GKE 集群中安装 Velero 时无法拉取映像 "velero/velero-plugin-for-gcp:v1.1.0"

Failed to pull image "velero/velero-plugin-for-gcp:v1.1.0" while installing Velero in GKE Cluster

我正在尝试为 kubernetes 备份安装和配置 Velero。我已经按照 link 在我的 GKE 集群中配置它。安装顺利,但 velero 无法正常工作。

我正在使用 google 云 shell 来执行 运行 我的所有命令(我已经在我的 google 云 shell 中安装并配置了 velero 客户端)

进一步检查 velero 部署和 velero pods,我发现它无法从 docker 存储库中提取图像。

kubectl get pods -n velero
NAME                      READY   STATUS              RESTARTS   AGE
velero-5489b955f6-kqb7z   0/1     Init:ErrImagePull   0          20s

来自 velero pod (kubectl describe pod) 的错误(为了便于阅读而对输出进行了编辑 - 下面仅显示了相关信息)

    Events:
  Type     Reason     Age               From                                                  Message
  ----     ------     ----              ----                                                  -------
  Normal   Scheduled  38s               default-scheduler                                     Successfully assigned velero/velero-5489b955f6-kqb7z to gke-gke-cluster1-default-pool-a354fba3-8674
  Warning  Failed     22s               kubelet, gke-gke-cluster1-default-pool-a354fba3-8674  Failed to pull image "velero/velero-plugin-for-gcp:v1.1.0": rpc error: code = Unknown desc = Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
  Warning  Failed     22s               kubelet, gke-gke-cluster1-default-pool-a354fba3-8674  Error: ErrImagePull
  Normal   BackOff    21s               kubelet, gke-gke-cluster1-default-pool-a354fba3-8674  Back-off pulling image "velero/velero-plugin-for-gcp:v1.1.0"
  Warning  Failed     21s               kubelet, gke-gke-cluster1-default-pool-a354fba3-8674  Error: ImagePullBackOff
  Normal   Pulling    8s (x2 over 37s)  kubelet, gke-gke-cluster1-default-pool-a354fba3-8674  Pulling image "velero/velero-plugin-for-gcp:v1.1.0"

用于安装 velero 的命令:(一些值作为变量给出)

velero install \
     --provider gcp \
     --plugins velero/velero-plugin-for-gcp:v1.1.0 \
     --bucket $storagebucket \
     --secret-file ~/velero-backup-storage-sa-key.json

Velero 版本

velero version
Client:
        Version: v1.4.2
        Git commit: 56a08a4d695d893f0863f697c2f926e27d70c0c5
<error getting server version: timed out waiting for server status request to be processed>

GKE 版本

v1.15.12-gke.2

Isn't this a Private Cluster ? – mario 31 mins ago

@mario this is a private cluster but I can deploy other services without any issues (for eg: I have deployed nginx successfully) – Sreesan 15 mins ago

嗯,这是一个 know limitation of GKE Private Clusters. As you can read in the documentation:

Can't pull image from public Docker Hub

Symptoms

A Pod running in your cluster displays a warning in kubectl describe such as Failed to pull image: rpc error: code = Unknown desc = Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Potential causes

Nodes in a private cluster do not have outbound access to the public internet. They have limited access to Google APIs and services, including Container Registry.

Resolution

You cannot fetch images directly from Docker Hub. Instead, use images hosted on Container Registry. Note that while Container Registry's Docker Hub mirror is accessible from a private cluster, it should not be exclusively relied upon. The mirror is only a cache, so images are periodically removed, and a private cluster is not able to fall back to Docker Hub.

您也可以将其与答案进行比较。

您可以通过简单的实验轻松验证。尝试 运行 两种不同的 nginx 部署。第一个基于图像 nginx(等于 nginx:latest),第二个基于 nginx:1.14.2.

虽然第一种情况完全可行,因为 nginx:latest 镜像可以从 Container Registry 的 Docker Hub 镜像 中提取,可以从私有集群访问,任何拉动 nginx:1.14.2 的尝试都会失败,您将在 Pod 事件中看到。发生这种情况是因为 kubelet 无法在 GCR 中找到此版本的图像,它会尝试从 public docker 注册表 (https://registry-1.docker.io/v2/),这在 Private Clusters 中是不可能的。 “镜像只是一个缓存,所以图像会定期删除,私有集群无法回退到 Docker Hub。” - 正如您在文档中所读.

如果您仍然有疑问,只需 ssh 进入您的节点并尝试 运行 以下命令:

curl https://cloud.google.com/container-registry/

curl https://registry-1.docker.io/v2/

虽然第一个完美运行,但第二个最终会失败:

curl: (7) Failed to connect to registry-1.docker.io port 443: Connection timed out

原因? - “私有集群中的节点无法出站访问 public 互联网。”

解决方案?

您可以搜索 GCR here.

当前可用的内容

在许多情况下,如果您不指定确切的版本(默认使用 latest 标签),您应该能够获得所需的图像。虽然它可以帮助 nginx,但不幸的是 Google Container Registry 的 Docker Hub 镜像中目前没有可用的 velero/velero-plugin-for-gcp 版本。

Granting private nodes outbound internet access by using Cloud NAT 似乎是唯一可以应用于您的情况的合理解决方案。