Terraform GKE 节点池启动,减少授权访问范围
Terraform GKE node-pools spin up with reduced auth access scopes
使用 Terraform,我使用此集群的唯一服务帐户为我的主要资源启动了以下资源:
resource "google_container_cluster" "primary" {
name = var.gke_cluster_name
location = var.region
# We can't create a cluster with no node pool defined, but we want to only use
# separately managed node pools. So we create the smallest possible default
# node pool and immediately delete it.
remove_default_node_pool = true
initial_node_count = 1
ip_allocation_policy {}
networking_mode = "VPC_NATIVE"
node_config {
service_account = google_service_account.cluster_sa.email
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform"
]
}
cluster_autoscaling {
enabled = true
resource_limits {
resource_type = "cpu"
maximum = 40
minimum = 3
}
resource_limits {
resource_type = "memory"
maximum = 100
minimum = 12
}
}
network = google_compute_network.vpc.name
subnetwork = google_compute_subnetwork.subnet.name
}
resource "google_container_node_pool" "primary_nodes" {
name = "${google_container_cluster.primary.name}-node-pool"
location = var.region
cluster = google_container_cluster.primary.name
node_count = var.gke_num_nodes
node_config {
service_account = google_service_account.cluster_sa.email
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform"
]
labels = {
env = var.project_id
}
disk_size_gb = 150
preemptible = true
machine_type = var.machine_type
tags = ["gke-node", "${var.project_id}-gke"]
metadata = {
disable-legacy-endpoints = "true"
}
}
}
即使我为节点提供了从 google 容器注册表 (roles/containerregistry.ServiceAgent
) 中提取的适当权限,有时我也会从 kubernetes 中随机得到一个 ImagePullError
:
Unexpected status code [manifests latest]: 401 Unauthorized
使用以下命令检查分配给节点池的服务帐户后:
gcloud container clusters describe master-env --zone="europe-west2" | grep "serviceAccount"
我看到以下输出:
serviceAccount: default
serviceAccount: master-env@<project-id>.iam.gserviceaccount.com
serviceAccount: master-env@<project-id>.iam.gserviceaccount.com
表明虽然我已经指定了正确的服务帐户分配给节点,但出于某种原因(我认为是 primary
池)它反而分配了 default
服务帐户,它使用错误的 oauth
范围:
oauthScopes:
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
而不是 https://www.googleapis.com/auth/cloud-platform
。
如何确保所有节点使用相同的服务帐户?
编辑 1:
从@GariSingh 实施修复后,现在我所有的节点池都使用与预期相同的 Service Account
,但是在将我的服务安装到集群时,有时我仍然会遇到 unexpected status code [manifests latest]: 401 Unauthorized
错误。
这很不寻常,因为安装到集群上的其他服务似乎可以毫无问题地从 gcr
中提取它们的映像。
描述 pod 事件显示如下:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 11m default-scheduler Successfully assigned default/<my-deployment> to gke-master-env-nap-e2-standard-2-<id>
Warning FailedMount 11m kubelet MountVolume.SetUp failed for volume "private-key" : failed to sync secret cache: timed out waiting for the condition
Warning FailedMount 11m kubelet MountVolume.SetUp failed for volume "kube-api-access-5hh9r" : failed to sync configmap cache: timed out waiting for the condition
Warning Failed 9m34s (x5 over 10m) kubelet Error: ImagePullBackOff
编辑 2:
拼图的最后一块是将 oauth_scopes
添加到类似于节点配置的 auto_provisioning_defaults
以便 ServiceAccount
可以正确使用。
不确定您是否打算使用 Node auto-provisioning (NAP)(我强烈建议您使用,除非它不满足您的需求),但 google_container_cluster
的 cluster_autoscaling
参数实际上启用了此功能.它不会为单个节点池启用集群自动缩放器。
如果您的目标是为您在配置中创建的节点池启用集群自动缩放而不使用 NAP,那么您需要删除 cluster_autoscaling
块并在下面添加一个 autoscaling block您的 google_container_node_pool
资源并将 node_count
更改为 initial_node_count
:
resource "google_container_node_pool" "primary_nodes" {
name = "${google_container_cluster.primary.name}-node-pool"
location = var.region
cluster = google_container_cluster.primary.name
initial_node_count = var.gke_num_nodes
node_config {
autoscaling {
min_node_count = var.min_nodes
max_node_count = var.max_nodes
}
service_account = google_service_account.cluster_sa.email
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform"
]
labels = {
env = var.project_id
}
disk_size_gb = 150
preemptible = true
machine_type = var.machine_type
tags = ["gke-node", "${var.project_id}-gke"]
metadata = {
disable-legacy-endpoints = "true"
}
}
}
(以上假定您为最小和最大节点设置了变量)
如果您想使用 NAP,则需要添加 auto_provisioning_defaults block 并配置 service_account
属性:
resource "google_container_cluster" "primary" {
name = var.gke_cluster_name
location = var.region
# We can't create a cluster with no node pool defined, but we want to only use
# separately managed node pools. So we create the smallest possible default
# node pool and immediately delete it.
remove_default_node_pool = true
initial_node_count = 1
ip_allocation_policy {}
networking_mode = "VPC_NATIVE"
node_config {
service_account = google_service_account.cluster_sa.email
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform"
]
}
cluster_autoscaling {
enabled = true
auto_provisioning_defaults {
service_account = google_service_account.cluster_sa.email
}
resource_limits {
resource_type = "cpu"
maximum = 40
minimum = 3
}
resource_limits {
resource_type = "memory"
maximum = 100
minimum = 12
}
}
network = google_compute_network.vpc.name
subnetwork = google_compute_subnetwork.subnet.name
}
使用 Terraform,我使用此集群的唯一服务帐户为我的主要资源启动了以下资源:
resource "google_container_cluster" "primary" {
name = var.gke_cluster_name
location = var.region
# We can't create a cluster with no node pool defined, but we want to only use
# separately managed node pools. So we create the smallest possible default
# node pool and immediately delete it.
remove_default_node_pool = true
initial_node_count = 1
ip_allocation_policy {}
networking_mode = "VPC_NATIVE"
node_config {
service_account = google_service_account.cluster_sa.email
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform"
]
}
cluster_autoscaling {
enabled = true
resource_limits {
resource_type = "cpu"
maximum = 40
minimum = 3
}
resource_limits {
resource_type = "memory"
maximum = 100
minimum = 12
}
}
network = google_compute_network.vpc.name
subnetwork = google_compute_subnetwork.subnet.name
}
resource "google_container_node_pool" "primary_nodes" {
name = "${google_container_cluster.primary.name}-node-pool"
location = var.region
cluster = google_container_cluster.primary.name
node_count = var.gke_num_nodes
node_config {
service_account = google_service_account.cluster_sa.email
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform"
]
labels = {
env = var.project_id
}
disk_size_gb = 150
preemptible = true
machine_type = var.machine_type
tags = ["gke-node", "${var.project_id}-gke"]
metadata = {
disable-legacy-endpoints = "true"
}
}
}
即使我为节点提供了从 google 容器注册表 (roles/containerregistry.ServiceAgent
) 中提取的适当权限,有时我也会从 kubernetes 中随机得到一个 ImagePullError
:
Unexpected status code [manifests latest]: 401 Unauthorized
使用以下命令检查分配给节点池的服务帐户后:
gcloud container clusters describe master-env --zone="europe-west2" | grep "serviceAccount"
我看到以下输出:
serviceAccount: default
serviceAccount: master-env@<project-id>.iam.gserviceaccount.com
serviceAccount: master-env@<project-id>.iam.gserviceaccount.com
表明虽然我已经指定了正确的服务帐户分配给节点,但出于某种原因(我认为是 primary
池)它反而分配了 default
服务帐户,它使用错误的 oauth
范围:
oauthScopes:
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
而不是 https://www.googleapis.com/auth/cloud-platform
。
如何确保所有节点使用相同的服务帐户?
编辑 1:
从@GariSingh 实施修复后,现在我所有的节点池都使用与预期相同的 Service Account
,但是在将我的服务安装到集群时,有时我仍然会遇到 unexpected status code [manifests latest]: 401 Unauthorized
错误。
这很不寻常,因为安装到集群上的其他服务似乎可以毫无问题地从 gcr
中提取它们的映像。
描述 pod 事件显示如下:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 11m default-scheduler Successfully assigned default/<my-deployment> to gke-master-env-nap-e2-standard-2-<id>
Warning FailedMount 11m kubelet MountVolume.SetUp failed for volume "private-key" : failed to sync secret cache: timed out waiting for the condition
Warning FailedMount 11m kubelet MountVolume.SetUp failed for volume "kube-api-access-5hh9r" : failed to sync configmap cache: timed out waiting for the condition
Warning Failed 9m34s (x5 over 10m) kubelet Error: ImagePullBackOff
编辑 2:
拼图的最后一块是将 oauth_scopes
添加到类似于节点配置的 auto_provisioning_defaults
以便 ServiceAccount
可以正确使用。
不确定您是否打算使用 Node auto-provisioning (NAP)(我强烈建议您使用,除非它不满足您的需求),但 google_container_cluster
的 cluster_autoscaling
参数实际上启用了此功能.它不会为单个节点池启用集群自动缩放器。
如果您的目标是为您在配置中创建的节点池启用集群自动缩放而不使用 NAP,那么您需要删除 cluster_autoscaling
块并在下面添加一个 autoscaling block您的 google_container_node_pool
资源并将 node_count
更改为 initial_node_count
:
resource "google_container_node_pool" "primary_nodes" {
name = "${google_container_cluster.primary.name}-node-pool"
location = var.region
cluster = google_container_cluster.primary.name
initial_node_count = var.gke_num_nodes
node_config {
autoscaling {
min_node_count = var.min_nodes
max_node_count = var.max_nodes
}
service_account = google_service_account.cluster_sa.email
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform"
]
labels = {
env = var.project_id
}
disk_size_gb = 150
preemptible = true
machine_type = var.machine_type
tags = ["gke-node", "${var.project_id}-gke"]
metadata = {
disable-legacy-endpoints = "true"
}
}
}
(以上假定您为最小和最大节点设置了变量)
如果您想使用 NAP,则需要添加 auto_provisioning_defaults block 并配置 service_account
属性:
resource "google_container_cluster" "primary" {
name = var.gke_cluster_name
location = var.region
# We can't create a cluster with no node pool defined, but we want to only use
# separately managed node pools. So we create the smallest possible default
# node pool and immediately delete it.
remove_default_node_pool = true
initial_node_count = 1
ip_allocation_policy {}
networking_mode = "VPC_NATIVE"
node_config {
service_account = google_service_account.cluster_sa.email
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform"
]
}
cluster_autoscaling {
enabled = true
auto_provisioning_defaults {
service_account = google_service_account.cluster_sa.email
}
resource_limits {
resource_type = "cpu"
maximum = 40
minimum = 3
}
resource_limits {
resource_type = "memory"
maximum = 100
minimum = 12
}
}
network = google_compute_network.vpc.name
subnetwork = google_compute_subnetwork.subnet.name
}