GKE NEG 入口总是 returns 502 错误网关

GKE NEG Ingress always returns 502 Bad Gateway

我在 Google Cloud Kubernetes Engine 集群上设置了 StatefulSet、带 NEG 的服务和 Ingress。

每个工作负载和网络对象都准备就绪且运行状况良好。创建入口并更新所有服务的 NEG 状态。为集群启用了 VPC-native (Alias-IP) 和 HTTP Load Balancer 选项。

但是当我尝试使用 Ingress 中指定的路径访问我的应用程序时,我总是收到 502(错误网关)错误。

这是我的配置(包括图像名称在内的名称都经过了编辑):

apiVersion: v1
kind: Service
metadata:
  annotations:
    cloud.google.com/neg: '{"ingress": true}'
  labels:
    app: myapp
  name: myapp
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: tcp
  selector:
    app: myapp
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: myapp
  name: myapp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: myapp
  serviceName: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
      - name: myapp
        livenessProbe:
          httpGet:
            path: /
            port: tcp
            scheme: HTTP
          initialDelaySeconds: 60
        image: myapp:8bebbaf
        ports:
        - containerPort: 1880
          name: tcp
          protocol: TCP
        readinessProbe:
          failureThreshold: 1
          httpGet:
            path: /
            port: tcp
            scheme: HTTP
        volumeMounts:
        - mountPath: /data
          name: data
      securityContext:
        fsGroup: 1000
      terminationGracePeriodSeconds: 10
  volumeClaimTemplates:
  - metadata:
      labels:
        app: myapp
      name: data
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 1Gi
---
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  name: myapp-ingress
spec:
  rules:
  - http:
      paths:
      - path: /workflow
        backend:
          serviceName: myapp
          servicePort: 80

它有什么问题,我该如何解决?

经过大量挖掘和测试,我终于找到了问题所在。此外,GKE NEG Ingress 似乎不是很稳定(实际上 NEG 处于测试阶段)并且并不总是符合 Kubernetes 规范。

an issue with GKE Ingress related to named ports in targetPort field. The fix is implemented and available from 1.16.0-gke.20 cluster version (Release),截至今天(2020 年 2 月)在 Rapid Channel 下可用,但我没有测试该修复程序,因为我在进入来自该频道的版本时遇到其他问题。

如果您遇到同样的问题,基本上有两种选择:

  1. 在服务的 targetPort 字段中指定确切的端口号而不是端口名称。这是我示例中的固定服务配置文件:

    apiVersion: v1
    kind: Service
    metadata:
      annotations:
        cloud.google.com/neg: '{"ingress": true}'
      labels:
        app: myapp
      name: myapp
    spec:
      ports:
      - port: 80
        protocol: TCP
        # !!!
        # targetPort: tcp
        targetPort: 1088
      selector:
        app: myapp
    
  2. 升级GKE集群至1.16.0-gke.20+版本(本人未测试)