Kubernetes/EKS 滚动更新导致停机

Question

我们对部署到 EKS 的服务有以下配置，但每当我们进行部署时，它都会导致大约 120 秒的停机时间。

当我直接端口转发到新的 pod 时，我可以成功地向它发出请求，所以 pod 本身看起来很好。似乎是 AWS NLB 没有路由流量或与网络相关，但我不确定，而且我不知道在哪里进一步调试。

我尝试了一些无济于事的方法：添加了 readinessProbe，尝试将 initialDelaySeconds 增加到 120，尝试切换到 IP ELB 目标，而不是比 instance ELB 目标类型，尝试减少 NLB 的健康检查间隔，但它实际上并没有被应用，仍然是 30 秒。

如有任何帮助，我们将不胜感激！

---
# Autoscaler for the frontend

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: my-frontend
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-frontend
  minReplicas: 3
  maxReplicas: 8
  targetCPUUtilizationPercentage: 60

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-frontend
  labels:
    app: my-frontend
spec:
  replicas: 3
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  selector:
    matchLabels:
      app: my-frontend
  template:
    metadata:
      labels:
        app: my-frontend
    spec:
      containers:
        - name: my-frontend
          image: ${DOCKER_IMAGE}
          ports:
            - containerPort: 3001
              name: web
          resources:
            requests:
              cpu: "300m"
              memory: "256Mi"
          livenessProbe:
            httpGet:
              scheme: HTTP
              path: /v1/ping
              port: 3001
            initialDelaySeconds: 5
            timeoutSeconds: 1
            periodSeconds: 10
          readinessProbe:
            httpGet:
              scheme: HTTP
              path: /v1/ping
              port: 3001
            initialDelaySeconds: 5
            timeoutSeconds: 1
            periodSeconds: 10
      restartPolicy: Always

---
apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http
    service.beta.kubernetes.io/aws-load-balancer-ssl-cert: ${SSL_CERTIFICATE_ARN}
    service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "https"
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "10"
    service.beta.kubernetes.io/aws-load-balancer-connection-draining-enabled: "true"
    service.beta.kubernetes.io/aws-load-balancer-connection-draining-timeout: "60"
  name: my-frontend
  labels:
    service: my-frontend
spec:
  ports:
    - name: http
      port: 80
      targetPort: 3001
    - name: https
      port: 443
      targetPort: 3001
  externalTrafficPolicy: Local
  selector:
    app: my-frontend
  type: LoadBalancer

Answer 1

这很可能是由于 NLB 对与您的 externalTrafficPolicy 设置直接相关的目标变化反应不够快所致。

如果您的应用程序不使用任何客户端 IP，您可以将 externalTrafficPolicy 设置为 ClusterIP 或通过将其删除保留为默认值。

如果您的应用程序需要保留客户端 IP，您可以使用本文 github issue which in short requires you to use blue-green deployment 中讨论的解决方案。

Kubernetes/EKS 滚动更新导致停机

Kubernetes/EKS rolling update causes downtime

nlb

kubernetes

amazon-eks

aws-nlb