Pods 之间 GKE 中的不良延迟
Bad latency in GKE between Pods
我们有一个非常奇怪的行为,在 kubernetes 集群 (GKE) 内通信有不可接受的大延迟。
对于具有 Memorystore get/store 操作和 CloudSQL 查询的端点,延迟在 600 毫秒和 1 秒之间跳跃。在开发环境中本地相同的设置 运行(尽管没有 k8s)没有显示这种延迟。
关于我们的架构:
我们是 运行 GKE 上的一个 k8s 集群,使用 terraform 和服务/部署 (yaml) 文件进行创建(我在下面添加了这些)。
我们 运行 两个节点 APIs (koa.js 2.5)。一个 API 暴露在 public 的入口,并通过节点端口连接到 API pod。
另一个 API pod 可通过内部负载均衡器从 google 私有访问。此 API 已连接到我们需要的所有资源(CloudSQL、Cloud Storage)。
两个 APIs 也连接到 Memorystore (Redis)。
pods 之间的通信使用自签名 server/client 证书进行保护(这不是问题,我们已经暂时将其删除以进行测试)。
我们检查了日志,发现从 public API 到私人请求只需要大约 200 毫秒就可以到达。
此外,私有 public API 的响应花费了大约 600 毫秒(从私有 API 的整个业务逻辑抛出到我们在公开 API)
我们真的没有什么可以尝试的了......我们已经将所有 Google 云资源连接到我们的本地环境,这并没有显示出那种糟糕的延迟。
在完整的本地设置中,延迟仅为我们在云设置中看到的延迟的 1/5 到 1/10。
我们还尝试从位于 0.100 毫秒区域的 public ping 私有 POD。
您有什么想法可以让我们进一步调查吗?
这是关于我们的 Google 云设置
的地形脚本
// Configure the Google Cloud provider
provider "google" {
project = "${var.project}"
region = "${var.region}"
}
data "google_compute_zones" "available" {}
# Ensuring relevant service APIs are enabled in your project. Alternatively visit and enable the needed services
resource "google_project_service" "serviceapi" {
service = "serviceusage.googleapis.com"
disable_on_destroy = false
}
resource "google_project_service" "sqlapi" {
service = "sqladmin.googleapis.com"
disable_on_destroy = false
depends_on = ["google_project_service.serviceapi"]
}
resource "google_project_service" "redisapi" {
service = "redis.googleapis.com"
disable_on_destroy = false
depends_on = ["google_project_service.serviceapi"]
}
# Create a VPC and a subnetwork in our region
resource "google_compute_network" "appnetwork" {
name = "${var.environment}-vpn"
auto_create_subnetworks = "false"
}
resource "google_compute_subnetwork" "network-with-private-secondary-ip-ranges" {
name = "${var.environment}-vpn-subnet"
ip_cidr_range = "10.2.0.0/16"
region = "europe-west1"
network = "${google_compute_network.appnetwork.self_link}"
secondary_ip_range {
range_name = "kubernetes-secondary-range-pods"
ip_cidr_range = "10.60.0.0/16"
}
secondary_ip_range {
range_name = "kubernetes-secondary-range-services"
ip_cidr_range = "10.70.0.0/16"
}
}
# GKE cluster setup
resource "google_container_cluster" "primary" {
name = "${var.environment}-cluster"
zone = "${data.google_compute_zones.available.names[1]}"
initial_node_count = 1
description = "Kubernetes Cluster"
network = "${google_compute_network.appnetwork.self_link}"
subnetwork = "${google_compute_subnetwork.network-with-private-secondary-ip-ranges.self_link}"
depends_on = ["google_project_service.serviceapi"]
additional_zones = [
"${data.google_compute_zones.available.names[0]}",
"${data.google_compute_zones.available.names[2]}",
]
master_auth {
username = "xxxxxxx"
password = "xxxxxxx"
}
ip_allocation_policy {
cluster_secondary_range_name = "kubernetes-secondary-range-pods"
services_secondary_range_name = "kubernetes-secondary-range-services"
}
node_config {
oauth_scopes = [
"https://www.googleapis.com/auth/compute",
"https://www.googleapis.com/auth/devstorage.read_only",
"https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/monitoring",
"https://www.googleapis.com/auth/trace.append"
]
tags = ["kubernetes", "${var.environment}"]
}
}
##################
# MySQL DATABASES
##################
resource "google_sql_database_instance" "core" {
name = "${var.environment}-sql-core"
database_version = "MYSQL_5_7"
region = "${var.region}"
depends_on = ["google_project_service.sqlapi"]
settings {
# Second-generation instance tiers are based on the machine
# type. See argument reference below.
tier = "db-n1-standard-1"
}
}
resource "google_sql_database_instance" "tenant1" {
name = "${var.environment}-sql-tenant1"
database_version = "MYSQL_5_7"
region = "${var.region}"
depends_on = ["google_project_service.sqlapi"]
settings {
# Second-generation instance tiers are based on the machine
# type. See argument reference below.
tier = "db-n1-standard-1"
}
}
resource "google_sql_database_instance" "tenant2" {
name = "${var.environment}-sql-tenant2"
database_version = "MYSQL_5_7"
region = "${var.region}"
depends_on = ["google_project_service.sqlapi"]
settings {
# Second-generation instance tiers are based on the machine
# type. See argument reference below.
tier = "db-n1-standard-1"
}
}
resource "google_sql_database" "core" {
name = "project_core"
instance = "${google_sql_database_instance.core.name}"
}
resource "google_sql_database" "tenant1" {
name = "project_tenant_1"
instance = "${google_sql_database_instance.tenant1.name}"
}
resource "google_sql_database" "tenant2" {
name = "project_tenant_2"
instance = "${google_sql_database_instance.tenant2.name}"
}
##################
# MySQL USERS
##################
resource "google_sql_user" "core-user" {
name = "${var.sqluser}"
instance = "${google_sql_database_instance.core.name}"
host = "cloudsqlproxy~%"
password = "${var.sqlpassword}"
}
resource "google_sql_user" "tenant1-user" {
name = "${var.sqluser}"
instance = "${google_sql_database_instance.tenant1.name}"
host = "cloudsqlproxy~%"
password = "${var.sqlpassword}"
}
resource "google_sql_user" "tenant2-user" {
name = "${var.sqluser}"
instance = "${google_sql_database_instance.tenant2.name}"
host = "cloudsqlproxy~%"
password = "${var.sqlpassword}"
}
##################
# REDIS
##################
resource "google_redis_instance" "redis" {
name = "${var.environment}-redis"
tier = "BASIC"
memory_size_gb = 1
depends_on = ["google_project_service.redisapi"]
authorized_network = "${google_compute_network.appnetwork.self_link}"
region = "${var.region}"
location_id = "${data.google_compute_zones.available.names[0]}"
redis_version = "REDIS_3_2"
display_name = "Redis Instance"
}
# The following outputs allow authentication and connectivity to the GKE Cluster.
output "client_certificate" {
value = "${google_container_cluster.primary.master_auth.0.client_certificate}"
}
output "client_key" {
value = "${google_container_cluster.primary.master_auth.0.client_key}"
}
output "cluster_ca_certificate" {
value = "${google_container_cluster.primary.master_auth.0.cluster_ca_certificate}"
}
私有的服务和部署API
# START CRUD POD
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: crud-pod
labels:
app: crud
spec:
template:
metadata:
labels:
app: crud
spec:
containers:
- name: crud
image: eu.gcr.io/dev-xxxxx/crud:latest-unstable
ports:
- containerPort: 3333
env:
- name: NODE_ENV
value: develop
volumeMounts:
- [..MountedConfigFiles..]
# [START proxy_container]
- name: cloudsql-proxy
image: gcr.io/cloudsql-docker/gce-proxy:1.11
command: ["/cloud_sql_proxy",
"-instances=dev-xxxx:europe-west1:dev-sql-core=tcp:3306,dev-xxxx:europe-west1:dev-sql-tenant1=tcp:3307,dev-xxxx:europe-west1:dev-sql-tenant2=tcp:3308",
"-credential_file=xxxx"]
volumeMounts:
- name: cloudsql-instance-credentials
mountPath: /secrets/cloudsql
readOnly: true
# [END proxy_container]
# [START volumes]
volumes:
- name: cloudsql-instance-credentials
secret:
secretName: cloudsql-instance-credentials
- [..ConfigFilesVolumes..]
# [END volumes]
# END CRUD POD
-------
# START CRUD SERVICE
apiVersion: v1
kind: Service
metadata:
name: crud
annotations:
cloud.google.com/load-balancer-type: "Internal"
spec:
type: LoadBalancer
loadBalancerSourceRanges:
- 10.60.0.0/16
ports:
- name: crud-port
port: 3333
protocol: TCP # default; can also specify UDP
selector:
app: crud # label selector for Pods to target
# END CRUD SERVICE
还有 public 个(包括入口)
# START SAPI POD
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: sapi-pod
labels:
app: sapi
spec:
template:
metadata:
labels:
app: sapi
spec:
containers:
- name: sapi
image: eu.gcr.io/dev-xxx/sapi:latest-unstable
ports:
- containerPort: 8080
env:
- name: NODE_ENV
value: develop
volumeMounts:
- [..MountedConfigFiles..]
volumes:
- [..ConfigFilesVolumes..]
# END SAPI POD
-------------
# START SAPI SERVICE
kind: Service
apiVersion: v1
metadata:
name: sapi # Service name
spec:
selector:
app: sapi
ports:
- port: 8080
targetPort: 8080
type: NodePort
# END SAPI SERVICE
--------------
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: dev-ingress
annotations:
kubernetes.io/ingress.global-static-ip-name: api-dev-static-ip
labels:
app: sapi-ingress
spec:
backend:
serviceName: sapi
servicePort: 8080
tls:
- hosts:
- xxxxx
secretName: xxxxx
我们通过从 logTransport 中删除 @google-cloud/logging-winston 解决了这个问题。
出于某种原因,它阻塞了我们的流量,导致我们的延迟非常糟糕。
我们有一个非常奇怪的行为,在 kubernetes 集群 (GKE) 内通信有不可接受的大延迟。 对于具有 Memorystore get/store 操作和 CloudSQL 查询的端点,延迟在 600 毫秒和 1 秒之间跳跃。在开发环境中本地相同的设置 运行(尽管没有 k8s)没有显示这种延迟。
关于我们的架构: 我们是 运行 GKE 上的一个 k8s 集群,使用 terraform 和服务/部署 (yaml) 文件进行创建(我在下面添加了这些)。 我们 运行 两个节点 APIs (koa.js 2.5)。一个 API 暴露在 public 的入口,并通过节点端口连接到 API pod。
另一个 API pod 可通过内部负载均衡器从 google 私有访问。此 API 已连接到我们需要的所有资源(CloudSQL、Cloud Storage)。
两个 APIs 也连接到 Memorystore (Redis)。
pods 之间的通信使用自签名 server/client 证书进行保护(这不是问题,我们已经暂时将其删除以进行测试)。
我们检查了日志,发现从 public API 到私人请求只需要大约 200 毫秒就可以到达。 此外,私有 public API 的响应花费了大约 600 毫秒(从私有 API 的整个业务逻辑抛出到我们在公开 API)
我们真的没有什么可以尝试的了......我们已经将所有 Google 云资源连接到我们的本地环境,这并没有显示出那种糟糕的延迟。
在完整的本地设置中,延迟仅为我们在云设置中看到的延迟的 1/5 到 1/10。 我们还尝试从位于 0.100 毫秒区域的 public ping 私有 POD。
您有什么想法可以让我们进一步调查吗? 这是关于我们的 Google 云设置
的地形脚本 // Configure the Google Cloud provider
provider "google" {
project = "${var.project}"
region = "${var.region}"
}
data "google_compute_zones" "available" {}
# Ensuring relevant service APIs are enabled in your project. Alternatively visit and enable the needed services
resource "google_project_service" "serviceapi" {
service = "serviceusage.googleapis.com"
disable_on_destroy = false
}
resource "google_project_service" "sqlapi" {
service = "sqladmin.googleapis.com"
disable_on_destroy = false
depends_on = ["google_project_service.serviceapi"]
}
resource "google_project_service" "redisapi" {
service = "redis.googleapis.com"
disable_on_destroy = false
depends_on = ["google_project_service.serviceapi"]
}
# Create a VPC and a subnetwork in our region
resource "google_compute_network" "appnetwork" {
name = "${var.environment}-vpn"
auto_create_subnetworks = "false"
}
resource "google_compute_subnetwork" "network-with-private-secondary-ip-ranges" {
name = "${var.environment}-vpn-subnet"
ip_cidr_range = "10.2.0.0/16"
region = "europe-west1"
network = "${google_compute_network.appnetwork.self_link}"
secondary_ip_range {
range_name = "kubernetes-secondary-range-pods"
ip_cidr_range = "10.60.0.0/16"
}
secondary_ip_range {
range_name = "kubernetes-secondary-range-services"
ip_cidr_range = "10.70.0.0/16"
}
}
# GKE cluster setup
resource "google_container_cluster" "primary" {
name = "${var.environment}-cluster"
zone = "${data.google_compute_zones.available.names[1]}"
initial_node_count = 1
description = "Kubernetes Cluster"
network = "${google_compute_network.appnetwork.self_link}"
subnetwork = "${google_compute_subnetwork.network-with-private-secondary-ip-ranges.self_link}"
depends_on = ["google_project_service.serviceapi"]
additional_zones = [
"${data.google_compute_zones.available.names[0]}",
"${data.google_compute_zones.available.names[2]}",
]
master_auth {
username = "xxxxxxx"
password = "xxxxxxx"
}
ip_allocation_policy {
cluster_secondary_range_name = "kubernetes-secondary-range-pods"
services_secondary_range_name = "kubernetes-secondary-range-services"
}
node_config {
oauth_scopes = [
"https://www.googleapis.com/auth/compute",
"https://www.googleapis.com/auth/devstorage.read_only",
"https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/monitoring",
"https://www.googleapis.com/auth/trace.append"
]
tags = ["kubernetes", "${var.environment}"]
}
}
##################
# MySQL DATABASES
##################
resource "google_sql_database_instance" "core" {
name = "${var.environment}-sql-core"
database_version = "MYSQL_5_7"
region = "${var.region}"
depends_on = ["google_project_service.sqlapi"]
settings {
# Second-generation instance tiers are based on the machine
# type. See argument reference below.
tier = "db-n1-standard-1"
}
}
resource "google_sql_database_instance" "tenant1" {
name = "${var.environment}-sql-tenant1"
database_version = "MYSQL_5_7"
region = "${var.region}"
depends_on = ["google_project_service.sqlapi"]
settings {
# Second-generation instance tiers are based on the machine
# type. See argument reference below.
tier = "db-n1-standard-1"
}
}
resource "google_sql_database_instance" "tenant2" {
name = "${var.environment}-sql-tenant2"
database_version = "MYSQL_5_7"
region = "${var.region}"
depends_on = ["google_project_service.sqlapi"]
settings {
# Second-generation instance tiers are based on the machine
# type. See argument reference below.
tier = "db-n1-standard-1"
}
}
resource "google_sql_database" "core" {
name = "project_core"
instance = "${google_sql_database_instance.core.name}"
}
resource "google_sql_database" "tenant1" {
name = "project_tenant_1"
instance = "${google_sql_database_instance.tenant1.name}"
}
resource "google_sql_database" "tenant2" {
name = "project_tenant_2"
instance = "${google_sql_database_instance.tenant2.name}"
}
##################
# MySQL USERS
##################
resource "google_sql_user" "core-user" {
name = "${var.sqluser}"
instance = "${google_sql_database_instance.core.name}"
host = "cloudsqlproxy~%"
password = "${var.sqlpassword}"
}
resource "google_sql_user" "tenant1-user" {
name = "${var.sqluser}"
instance = "${google_sql_database_instance.tenant1.name}"
host = "cloudsqlproxy~%"
password = "${var.sqlpassword}"
}
resource "google_sql_user" "tenant2-user" {
name = "${var.sqluser}"
instance = "${google_sql_database_instance.tenant2.name}"
host = "cloudsqlproxy~%"
password = "${var.sqlpassword}"
}
##################
# REDIS
##################
resource "google_redis_instance" "redis" {
name = "${var.environment}-redis"
tier = "BASIC"
memory_size_gb = 1
depends_on = ["google_project_service.redisapi"]
authorized_network = "${google_compute_network.appnetwork.self_link}"
region = "${var.region}"
location_id = "${data.google_compute_zones.available.names[0]}"
redis_version = "REDIS_3_2"
display_name = "Redis Instance"
}
# The following outputs allow authentication and connectivity to the GKE Cluster.
output "client_certificate" {
value = "${google_container_cluster.primary.master_auth.0.client_certificate}"
}
output "client_key" {
value = "${google_container_cluster.primary.master_auth.0.client_key}"
}
output "cluster_ca_certificate" {
value = "${google_container_cluster.primary.master_auth.0.cluster_ca_certificate}"
}
私有的服务和部署API
# START CRUD POD
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: crud-pod
labels:
app: crud
spec:
template:
metadata:
labels:
app: crud
spec:
containers:
- name: crud
image: eu.gcr.io/dev-xxxxx/crud:latest-unstable
ports:
- containerPort: 3333
env:
- name: NODE_ENV
value: develop
volumeMounts:
- [..MountedConfigFiles..]
# [START proxy_container]
- name: cloudsql-proxy
image: gcr.io/cloudsql-docker/gce-proxy:1.11
command: ["/cloud_sql_proxy",
"-instances=dev-xxxx:europe-west1:dev-sql-core=tcp:3306,dev-xxxx:europe-west1:dev-sql-tenant1=tcp:3307,dev-xxxx:europe-west1:dev-sql-tenant2=tcp:3308",
"-credential_file=xxxx"]
volumeMounts:
- name: cloudsql-instance-credentials
mountPath: /secrets/cloudsql
readOnly: true
# [END proxy_container]
# [START volumes]
volumes:
- name: cloudsql-instance-credentials
secret:
secretName: cloudsql-instance-credentials
- [..ConfigFilesVolumes..]
# [END volumes]
# END CRUD POD
-------
# START CRUD SERVICE
apiVersion: v1
kind: Service
metadata:
name: crud
annotations:
cloud.google.com/load-balancer-type: "Internal"
spec:
type: LoadBalancer
loadBalancerSourceRanges:
- 10.60.0.0/16
ports:
- name: crud-port
port: 3333
protocol: TCP # default; can also specify UDP
selector:
app: crud # label selector for Pods to target
# END CRUD SERVICE
还有 public 个(包括入口)
# START SAPI POD
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: sapi-pod
labels:
app: sapi
spec:
template:
metadata:
labels:
app: sapi
spec:
containers:
- name: sapi
image: eu.gcr.io/dev-xxx/sapi:latest-unstable
ports:
- containerPort: 8080
env:
- name: NODE_ENV
value: develop
volumeMounts:
- [..MountedConfigFiles..]
volumes:
- [..ConfigFilesVolumes..]
# END SAPI POD
-------------
# START SAPI SERVICE
kind: Service
apiVersion: v1
metadata:
name: sapi # Service name
spec:
selector:
app: sapi
ports:
- port: 8080
targetPort: 8080
type: NodePort
# END SAPI SERVICE
--------------
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: dev-ingress
annotations:
kubernetes.io/ingress.global-static-ip-name: api-dev-static-ip
labels:
app: sapi-ingress
spec:
backend:
serviceName: sapi
servicePort: 8080
tls:
- hosts:
- xxxxx
secretName: xxxxx
我们通过从 logTransport 中删除 @google-cloud/logging-winston 解决了这个问题。 出于某种原因,它阻塞了我们的流量,导致我们的延迟非常糟糕。