Terraform GKE 节点池启动,减少授权访问范围

Terraform GKE node-pools spin up with reduced auth access scopes

使用 Terraform,我使用此集群的唯一服务帐户为我的主要资源启动了以下资源:

resource "google_container_cluster" "primary" {
  name     = var.gke_cluster_name
  location = var.region
  
  # We can't create a cluster with no node pool defined, but we want to only use
  # separately managed node pools. So we create the smallest possible default
  # node pool and immediately delete it.
  remove_default_node_pool = true
  initial_node_count       = 1

  ip_allocation_policy {}
  networking_mode = "VPC_NATIVE"

  node_config {
    service_account = google_service_account.cluster_sa.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]
  }

  cluster_autoscaling {
    enabled = true
    
    resource_limits {
      resource_type = "cpu"
      maximum = 40
      minimum = 3
    }

    resource_limits {
      resource_type = "memory"
      maximum = 100
      minimum = 12
    }
  }

  network    = google_compute_network.vpc.name
  subnetwork = google_compute_subnetwork.subnet.name
}

resource "google_container_node_pool" "primary_nodes" {
  name       = "${google_container_cluster.primary.name}-node-pool"
  location   = var.region
  cluster    = google_container_cluster.primary.name
  node_count = var.gke_num_nodes

  node_config {

    service_account = google_service_account.cluster_sa.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]

    labels = {
      env = var.project_id
    }
    disk_size_gb = 150
    preemptible  = true
    machine_type = var.machine_type
    tags         = ["gke-node", "${var.project_id}-gke"]
    metadata = {
      disable-legacy-endpoints = "true"
    }
  }
}

即使我为节点提供了从 google 容器注册表 (roles/containerregistry.ServiceAgent) 中提取的适当权限,有时我也会从 kubernetes 中随机得到一个 ImagePullError:

Unexpected status code [manifests latest]: 401 Unauthorized

使用以下命令检查分配给节点池的服务帐户后:

gcloud container clusters describe master-env --zone="europe-west2" | grep "serviceAccount"

我看到以下输出:

serviceAccount: default
serviceAccount: master-env@<project-id>.iam.gserviceaccount.com
serviceAccount: master-env@<project-id>.iam.gserviceaccount.com

表明虽然我已经指定了正确的服务帐户分配给节点,但出于某种原因(我认为是 primary 池)它反而分配了 default 服务帐户,它使用错误的 oauth 范围:

oauthScopes:
    - https://www.googleapis.com/auth/logging.write
    - https://www.googleapis.com/auth/monitoring

而不是 https://www.googleapis.com/auth/cloud-platform

如何确保所有节点使用相同的服务帐户?

编辑 1:

从@GariSingh 实施修复后,现在我所有的节点池都使用与预期相同的 Service Account,但是在将我的服务安装到集群时,有时我仍然会遇到 unexpected status code [manifests latest]: 401 Unauthorized 错误。

这很不寻常,因为安装到集群上的其他服务似乎可以毫无问题地从 gcr 中提取它们的映像。

描述 pod 事件显示如下:

Events:
  Type     Reason       Age                  From               Message
  ----     ------       ----                 ----               -------
  Normal   Scheduled    11m                  default-scheduler  Successfully assigned default/<my-deployment> to gke-master-env-nap-e2-standard-2-<id>
  Warning  FailedMount  11m                  kubelet            MountVolume.SetUp failed for volume "private-key" : failed to sync secret cache: timed out waiting for the condition
  Warning  FailedMount  11m                  kubelet            MountVolume.SetUp failed for volume "kube-api-access-5hh9r" : failed to sync configmap cache: timed out waiting for the condition
  Warning  Failed       9m34s (x5 over 10m)  kubelet            Error: ImagePullBackOff

编辑 2:

拼图的最后一块是将 oauth_scopes 添加到类似于节点配置的 auto_provisioning_defaults 以便 ServiceAccount 可以正确使用。

不确定您是否打算使用 Node auto-provisioning (NAP)(我强烈建议您使用,除非它不满足您的需求),但 google_container_clustercluster_autoscaling 参数实际上启用了此功能.它不会为单个节点池启用集群自动缩放器。

如果您的目标是为您在配置中创建的节点池启用集群自动缩放而不使用 NAP,那么您需要删除 cluster_autoscaling 块并在下面添加一个 autoscaling block您的 google_container_node_pool 资源并将 node_count 更改为 initial_node_count:

resource "google_container_node_pool" "primary_nodes" {
  name       = "${google_container_cluster.primary.name}-node-pool"
  location   = var.region
  cluster    = google_container_cluster.primary.name
  initial_node_count = var.gke_num_nodes

  node_config {
    autoscaling {
      min_node_count = var.min_nodes
      max_node_count = var.max_nodes
    }
    service_account = google_service_account.cluster_sa.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]

    labels = {
      env = var.project_id
    }
    disk_size_gb = 150
    preemptible  = true
    machine_type = var.machine_type
    tags         = ["gke-node", "${var.project_id}-gke"]
    metadata = {
      disable-legacy-endpoints = "true"
    }
  }
}

(以上假定您为最小和最大节点设置了变量)

如果您想使用 NAP,则需要添加 auto_provisioning_defaults block 并配置 service_account 属性:

resource "google_container_cluster" "primary" {
  name     = var.gke_cluster_name
  location = var.region
  
  # We can't create a cluster with no node pool defined, but we want to only use
  # separately managed node pools. So we create the smallest possible default
  # node pool and immediately delete it.
  remove_default_node_pool = true
  initial_node_count       = 1

  ip_allocation_policy {}
  networking_mode = "VPC_NATIVE"

  node_config {
    service_account = google_service_account.cluster_sa.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]
  }

  cluster_autoscaling {
    enabled = true
    
    auto_provisioning_defaults {
      service_account = google_service_account.cluster_sa.email
    }      

    resource_limits {
      resource_type = "cpu"
      maximum = 40
      minimum = 3
    }

    resource_limits {
      resource_type = "memory"
      maximum = 100
      minimum = 12
    }
  }

  network    = google_compute_network.vpc.name
  subnetwork = google_compute_subnetwork.subnet.name
}