如何配置 GKE Autopilot w/Envoy & gRPC-Web

Question

我的本地机器上有一个应用程序运行，它使用 React -> gRPC-Web -> Envoy -> Go 应用程序，一切运行都没有问题。我正在尝试使用 GKE Autopilot 进行部署，但我一直无法正确配置。我是所有 GCP/GKE 的新手，所以我正在寻求帮助以找出我哪里出错了。

我最初是在关注这个文档，尽管我只有一个 gRPC 服务： https://cloud.google.com/architecture/exposing-grpc-services-on-gke-using-envoy-proxy

据我了解，GKE Autopilot 模式需要使用外部 HTTP(s) 负载平衡而不是上述解决方案中所述的网络负载平衡，因此我一直在努力让它发挥作用。经过各种尝试，我目前的策略有Ingress、BackendConfig、Service、Deployment。该部署包含三个容器：我的应用程序、一个用于转换 gRPC-Web 请求和响应的 Envoy sidecar，以及一个云 SQL 代理 sidecar。我最终想使用 TLS，但现在，我没有考虑它，以免让事情变得更复杂。

当我应用所有配置时，后端服务显示一个后端在一个区域并且健康检查失败。健康检查是为端口 8080 和路径 /healthz 设置的，这是我认为我在部署配置中指定的，但我很怀疑，因为当我查看 envoy-sidecar 容器的详细信息时，它显示了 Readiness 探测器如：http-get HTTP://:0/healthz headers=x-envoy-livenessprobe:healthz。 ":0" 是否仅表示它正在使用容器的默认地址和端口，还是表示存在配置问题？

我一直在阅读各种文档，但一直无法将它们拼凑在一起。某处是否有示例说明如何做到这一点？我一直在寻找，但没有找到。

我当前的配置是：

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: grammar-games-ingress
  #annotations:
    # If the class annotation is not specified it defaults to "gce".
    # kubernetes.io/ingress.class: "gce"
    # kubernetes.io/ingress.global-static-ip-name: <IP addr>
spec:
  defaultBackend:
    service:
      name: grammar-games-core
      port:
        number: 80
---
apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
  name: grammar-games-bec
  annotations:
    cloud.google.com/neg: '{"ingress": true}'
spec:
  sessionAffinity:
    affinityType: "CLIENT_IP"  
  healthCheck:
    checkIntervalSec: 15
    port: 8080
    type: HTTP
    requestPath: /healthz
  timeoutSec: 60
---
apiVersion: v1
kind: Service
metadata:
  name: grammar-games-core
  annotations:
    cloud.google.com/neg: '{"ingress": true}'
    cloud.google.com/app-protocols: '{"http":"HTTP"}'
    cloud.google.com/backend-config: '{"default": "grammar-games-bec"}'
spec:
  type: ClusterIP
  selector:
    app: grammar-games-core
  ports:
  - name: http
    protocol: TCP
    port: 80
    targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grammar-games-core
spec:
  # Two replicas for right now, just so I can see how RPC calls get directed.
  # replicas: 2
  selector:
    matchLabels:
      app: grammar-games-core
  template:
    metadata:
      labels:
        app: grammar-games-core
    spec:
      serviceAccountName: grammar-games-core-k8sa
      containers:
      - name: grammar-games-core
        image: gcr.io/grammar-games/grammar-games-core:1.1.2
        command:
          - "/bin/grammar-games-core"
        ports:
        - containerPort: 52001
        env:
        - name: GAMESDB_USER
          valueFrom:
            secretKeyRef:
              name: gamesdb-config
              key: username
        - name: GAMESDB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: gamesdb-config
              key: password
        - name: GAMESDB_DB_NAME
          valueFrom:
            secretKeyRef:
              name: gamesdb-config
              key: db-name 
        - name: GRPC_SERVER_PORT
          value: '52001'
        - name: GAMES_LOG_FILE_PATH
          value: ''
        - name: GAMESDB_LOG_LEVEL
          value: 'debug'
        resources:
          requests:
            # The proxy's memory use scales linearly with the number of active
            # connections. Fewer open connections will use less memory. Adjust
            # this value based on your application's requirements.
            memory: "2Gi"
            # The proxy's CPU use scales linearly with the amount of IO between
            # the database and the application. Adjust this value based on your
            # application's requirements.
            cpu:    "1"
        readinessProbe:
          exec:
            command: ["/bin/grpc_health_probe", "-addr=:52001"]
          initialDelaySeconds: 5
      - name: cloud-sql-proxy
        # It is recommended to use the latest version of the Cloud SQL proxy
        # Make sure to update on a regular schedule!
        image: gcr.io/cloudsql-docker/gce-proxy:1.24.0
        command:
          - "/cloud_sql_proxy"

          # If connecting from a VPC-native GKE cluster, you can use the
          # following flag to have the proxy connect over private IP
          # - "-ip_address_types=PRIVATE"

          # Replace DB_PORT with the port the proxy should listen on
          # Defaults: MySQL: 3306, Postgres: 5432, SQLServer: 1433
          - "-instances=grammar-games:us-east1:grammar-games-db=tcp:3306"
        securityContext:
          # The default Cloud SQL proxy image runs as the
          # "nonroot" user and group (uid: 65532) by default.
          runAsNonRoot: true
        # Resource configuration depends on an application's requirements. You
        # should adjust the following values based on what your application
        # needs. For details, see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
        resources:
          requests:
            # The proxy's memory use scales linearly with the number of active
            # connections. Fewer open connections will use less memory. Adjust
            # this value based on your application's requirements.
            memory: "2Gi"
            # The proxy's CPU use scales linearly with the amount of IO between
            # the database and the application. Adjust this value based on your
            # application's requirements.
            cpu:    "1"
      - name: envoy-sidecar
        image: envoyproxy/envoy:v1.20-latest
        ports:
        - name: http
          containerPort: 8080
        resources:
          requests:
            cpu: 10m
            ephemeral-storage: 256Mi
            memory: 256Mi
        volumeMounts:
        - name: config
          mountPath: /etc/envoy
        readinessProbe:
          httpGet:
            port: http
            httpHeaders:
            - name: x-envoy-livenessprobe
              value: healthz
            path: /healthz
            scheme: HTTP
      volumes:
      - name: config
        configMap:
          name: envoy-sidecar-conf      
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: envoy-sidecar-conf
data:
  envoy.yaml: |
    static_resources:
      listeners:
      - name: listener_0
        address:
          socket_address:
            address: 0.0.0.0
            port_value: 8080
        filter_chains:
        - filters:
          - name: envoy.filters.network.http_connection_manager
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
              access_log:
              - name: envoy.access_loggers.stdout
                typed_config:
                  "@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
              codec_type: AUTO
              stat_prefix: ingress_http
              route_config:
                name: local_route
                virtual_hosts:
                - name: http
                  domains:
                  - "*"
                  routes:
                  - match:
                      prefix: "/grammar_games_protos.GrammarGames/"
                    route:
                      cluster: grammar-games-core-grpc
                  cors:
                    allow_origin_string_match:
                    - prefix: "*"
                    allow_methods: GET, PUT, DELETE, POST, OPTIONS
                    allow_headers: keep-alive,user-agent,cache-control,content-type,content-transfer-encoding,custom-header-1,x-accept-content-transfer-encoding,x-accept-response-streaming,x-user-agent,x-grpc-web,grpc-timeout
                    max_age: "1728000"
                    expose_headers: custom-header-1,grpc-status,grpc-message
              http_filters:
              - name: envoy.filters.http.health_check
                typed_config:
                  "@type": type.googleapis.com/envoy.extensions.filters.http.health_check.v3.HealthCheck
                  pass_through_mode: false
                  headers:
                  - name: ":path"
                    exact_match: "/healthz"
                  - name: "x-envoy-livenessprobe"
                    exact_match: "healthz"
              - name: envoy.filters.http.grpc_web
              - name: envoy.filters.http.cors
              - name: envoy.filters.http.router
                typed_config: {}
      clusters:
      - name: grammar-games-core-grpc
        connect_timeout: 0.5s
        type: logical_dns
        lb_policy: ROUND_ROBIN
        http2_protocol_options: {}
        load_assignment:
          cluster_name: grammar-games-core-grpc
          endpoints:
          - lb_endpoints:
            - endpoint:
                address:
                  socket_address:
                    address: 0.0.0.0
                    port_value: 52001
        health_checks:
          timeout: 1s
          interval: 10s
          unhealthy_threshold: 2
          healthy_threshold: 2
          grpc_health_check: {}
    admin:
      access_log_path: /dev/stdout
      address:
        socket_address:
          address: 127.0.0.1
          port_value: 8090

Answer 1

这是一些关于 Setting up HTTP(S) Load Balancing with Ingress 的文档。本教程介绍如何通过配置 Ingress 资源运行外部 HTTP(S) 负载平衡器后面的 Web 应用程序。

关于使用 Ingress 在 GKE 上创建 HTTP 负载均衡器，我发现了两个线程，其中创建的实例被标记为不健康。

In the first one，他们提到需要手动启用防火墙规则以允许 http 负载均衡器 ip 范围通过健康检查。

In the second one，他们提到 Pod 的规范还必须包括 containerPort。示例：

spec:
  containers:
  - name: nginx
    image: nginx:1.7.9
    ports:
    - containerPort: 80

除此之外，这里还有一些关于以下内容的文档：

Answer 2

我终于解决了这个问题，所以想post我的答案以供参考。

事实证明，本文档中的解决方案有效：

https://cloud.google.com/architecture/exposing-grpc-services-on-gke-using-envoy-proxy#introduction

在一篇关于 GKE 自动驾驶模式的文档中，我的印象是您不能使用网络负载均衡器，而是需要使用 Ingress 进行 HTTP(S) 负载均衡。这就是我采用其他方法的原因，但即使在使用 Google 支持几周后，配置看起来都是正确的，但负载均衡器的健康检查无法正常工作。就在那时，我们发现这个带有网络负载均衡器的解决方案确实可行。

我在配置 https/TLS 时也遇到了一些问题。结果证明这是我的特使配置文件中的一个问题。

我还有一个关于 pods 稳定性的问题，但这是一个单独的问题，我将在另一个 post/question 中解决。只要我只要求 1 个副本，解决方案就稳定且运行良好，自动驾驶仪应该根据需要扩展 pods。

我知道所有这些的配置都非常棘手，所以我将其全部包含在此处以供参考（仅使用 my-app 作为占位符）。希望它能帮助其他人比我更快到达那里！我认为一旦 gRPC-Web 可以正常工作，它就是一个很好的解决方案。您还会注意到我正在使用 cloud-sql-proxy sidecar 与 DB Cloud SQL 通信，并且我正在使用 GKE 服务帐户进行身份验证。

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      serviceAccountName: my-app-k8sa
      terminationGracePeriodSeconds: 30
      containers:
      - name: my-app
        image: gcr.io/my-project/my-app:1.1.0
        command:
          - "/bin/my-app"
        ports:
        - containerPort: 52001
        env:
        - name: DB_USER
          valueFrom:
            secretKeyRef:
              name: db-config
              key: username
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: db-config
              key: password
        - name: DB_NAME
          valueFrom:
            secretKeyRef:
              name: db-config
              key: db-name 
        - name: GRPC_SERVER_PORT
          value: '52001'
        readinessProbe:
          exec:
            command: ["/bin/grpc_health_probe", "-addr=:52001"]
          initialDelaySeconds: 10
        livenessProbe:
          exec:
            command: ["/bin/grpc_health_probe", "-addr=:52001"]
          initialDelaySeconds: 15
      - name: cloud-sql-proxy
        # It is recommended to use the latest version of the Cloud SQL proxy
        # Make sure to update on a regular schedule!
        image: gcr.io/cloudsql-docker/gce-proxy:1.27.1
        command:
          - "/cloud_sql_proxy"

          # If connecting from a VPC-native GKE cluster, you can use the
          # following flag to have the proxy connect over private IP
          # - "-ip_address_types=PRIVATE"

          # Replace DB_PORT with the port the proxy should listen on
          # Defaults: MySQL: 3306, Postgres: 5432, SQLServer: 1433
          - "-instances=my-project:us-east1:my-app-db=tcp:3306"
        securityContext:
          # The default Cloud SQL proxy image runs as the
          # "nonroot" user and group (uid: 65532) by default.
          runAsNonRoot: true

---
apiVersion: v1
kind: Service
metadata:
  name: my-app
spec:
  type: ClusterIP
  selector:
    app: my-app
  ports:
  - name: my-app-port
    protocol: TCP
    port: 52001
  clusterIP: None
---
apiVersion: v1
kind: Service
metadata:
  name: envoy
spec:
  type: LoadBalancer
  selector:
    app: envoy
  ports:
  - name: https
    protocol: TCP
    port: 443
    targetPort: 8443
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: envoy
spec:
  replicas: 1
  selector:
    matchLabels:
      app: envoy
  template:
    metadata:
      labels:
        app: envoy
    spec:
      containers:
      - name: envoy
        image: envoyproxy/envoy:v1.20-latest
        ports:
        - name: https
          containerPort: 8443
        resources:
          requests:
            cpu: 10m
            ephemeral-storage: 256Mi
            memory: 256Mi
        volumeMounts:
        - name: config
          mountPath: /etc/envoy
        - name: certs
          mountPath: /etc/ssl/envoy
        readinessProbe:
          httpGet:
            port: https
            httpHeaders:
            - name: x-envoy-livenessprobe
              value: healthz
            path: /healthz
            scheme: HTTPS
      volumes:
      - name: config
        configMap:
          name: envoy-conf
      - name: certs
        secret:
          secretName: envoy-certs
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: envoy-conf
data:
  envoy.yaml: |
    static_resources:
      listeners:
      - name: listener_0
        address:
          socket_address:
            address: 0.0.0.0
            port_value: 8443
        filter_chains:
        - filters:
          - name: envoy.filters.network.http_connection_manager
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
              access_log:
              - name: envoy.access_loggers.stdout
                typed_config:
                  "@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
              codec_type: AUTO
              stat_prefix: ingress_https
              route_config:
                name: local_route
                virtual_hosts:
                - name: https
                  domains:
                  - "*"
                  routes:
                  - match:
                      prefix: "/my_app_protos.MyService/"
                    route:
                      cluster: my-app-cluster
                  cors:
                    allow_origin_string_match:
                    - prefix: "*"
                    allow_methods: GET, PUT, DELETE, POST, OPTIONS
                    allow_headers: keep-alive,user-agent,cache-control,content-type,content-transfer-encoding,custom-header-1,x-accept-content-transfer-encoding,x-accept-response-streaming,x-user-agent,x-grpc-web,grpc-timeout
                    max_age: "1728000"
                    expose_headers: custom-header-1,grpc-status,grpc-message
              http_filters:
              - name: envoy.filters.http.health_check
                typed_config:
                  "@type": type.googleapis.com/envoy.extensions.filters.http.health_check.v3.HealthCheck
                  pass_through_mode: false
                  headers:
                  - name: ":path"
                    exact_match: "/healthz"
                  - name: "x-envoy-livenessprobe"
                    exact_match: "healthz"
              - name: envoy.filters.http.grpc_web
              - name: envoy.filters.http.cors
              - name: envoy.filters.http.router
                typed_config: {}
          transport_socket:
            name: envoy.transport_sockets.tls
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
              require_client_certificate: false
              common_tls_context:
                tls_certificates:
                - certificate_chain:
                    filename: /etc/ssl/envoy/tls.crt
                  private_key:
                    filename: /etc/ssl/envoy/tls.key
      clusters:
      - name: my-app-cluster
        connect_timeout: 0.5s
        type: STRICT_DNS
        dns_lookup_family: V4_ONLY
        lb_policy: ROUND_ROBIN
        http2_protocol_options: {}
        load_assignment:
          cluster_name: my-app-cluster
          endpoints:
          - lb_endpoints:
            - endpoint:
                address:
                  socket_address:
                    address: my-app.default.svc.cluster.local
                    port_value: 52001
        health_checks:
          timeout: 1s
          interval: 10s
          unhealthy_threshold: 2
          healthy_threshold: 2
          grpc_health_check: {}
    admin:
      access_log_path: /dev/stdout
      address:
        socket_address:
          address: 127.0.0.1
          port_value: 8090

我仍然不确定在 Deployment 中指定两个容器的资源要求和副本数，但解决方案有效。

如何配置 GKE Autopilot w/Envoy & gRPC-Web

How to configure GKE Autopilot w/Envoy & gRPC-Web

google-kubernetes-engine

grpc-web

autopilot