当某些上游无法访问时,如何将特使代理配置为故障转移

How can I config envoy proxy to failover when some upstream being unreachable

我是 envoy 代理的新手,我需要的是使用 envoy 作为 grpc 客户端和服务器之间的 sidecar 代理。

至此,我已经连接了一个grpc客户端和两个服务器,lb_policy设置为ROUND_ROBIN。但是当我关闭其中一台服务器时,grpc 客户端调用会失败。

那么,我该如何配置 envoy 来处理这种情况?

这是我的特使配置:

admin:
  access_log_path: "/tmp/admin_access.log"
  address:
    socket_address:
      address: "10.19.17.188"
      port_value: 12000
static_resources:
  listeners:
    -
      name: "grpc-listener"
      address:
        socket_address:
          address: "10.19.17.188"
          port_value: 12001
      filter_chains:
        -
          filters:
            -
              name: "envoy.http_connection_manager"
              config:
                stat_prefix: "ingress"
                codec_type: "AUTO"
                route_config:
                  name: "grpc-route"
                  virtual_hosts:
                    -
                      name: "grpc-route"
                      domains:
                        - "*"
                      routes:
                        -
                          match:
                            prefix: "/"
                          route:
                            cluster: "grpc-service"
                http_filters:
                  -
                    name: "envoy.router"

  clusters:
      -
        name: "grpc-service"
        connect_timeout: "0.25s"
        type: "static"
        lb_policy: "ROUND_ROBIN"
        http2_protocol_options: {}
        hosts:
          -
            socket_address:
              address: "10.19.17.188"
              port_value: 12011
          -
            socket_address:
              address: "10.19.17.188"
              port_value: 12012

python grpc客户端报错信息:

Traceback (most recent call last):
  File "greeter_client.py", line 40, in <module>
    run()
  File "greeter_client.py", line 32, in run
    response = stub.SayHello(helloworld_pb2.HelloRequest(name='%03d'%i))
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 565, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
    raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "upstream connect error or disconnect/reset before headers. reset reason: connection failure"
    debug_error_string = "{"created":"@1568810074.217216860","description":"Error received from peer ipv4:10.19.17.188:12001","file":"src/core/lib/surface/call.cc","file_line":1052,"grpc_message":"upstream connect error or disconnect/reset before headers. reset reason: connection failure","grpc_status":14}"

特使日志:

[2019-09-18 12:29:02.380][798][debug][pool] [source/common/http/conn_pool_base.cc:20] queueing request due to no available connections
[2019-09-18 12:29:02.380][798][debug][http] [source/common/http/conn_manager_impl.cc:1111] [C285][S679184262732628339] request end stream
[2019-09-18 12:29:02.380][798][debug][connection] [source/common/network/connection_impl.cc:561] [C286] delayed connection error: 111
[2019-09-18 12:29:02.380][798][debug][connection] [source/common/network/connection_impl.cc:190] [C286] closing socket: 0
[2019-09-18 12:29:02.380][798][debug][client] [source/common/http/codec_client.cc:82] [C286] disconnect. resetting 0 pending requests
[2019-09-18 12:29:02.380][798][debug][pool] [source/common/http/http2/conn_pool.cc:149] [C286] client disconnected
[2019-09-18 12:29:02.380][798][debug][router] [source/common/router/router.cc:868] [C285][S679184262732628339] upstream reset: reset reason connection failure
[2019-09-18 12:29:02.380][798][debug][http] [source/common/http/conn_manager_impl.cc:1186] [C285][S679184262732628339] Sending local reply with details upstream_reset_before_response_started{connection failure}
[2019-09-18 12:29:02.380][798][debug][http] [source/common/http/conn_manager_impl.cc:1378] [C285][S679184262732628339] encoding headers via codec (end_stream=true):
':status', '200'
'content-type', 'application/grpc'
'grpc-status', '14'
'grpc-message', 'upstream connect error or disconnect/reset before headers. reset reason: connection failure'
'date', 'Wed, 18 Sep 2019 12:29:02 GMT'
'server', 'envoy'

outlier detection

有帮助

  clusters:
      -
        name: "grpc-service"
        connect_timeout: "0.25s"
        type: "static"
        lb_policy: "ROUND_ROBIN"
        http2_protocol_options: {}
        hosts:
          -
            socket_address:
              address: "10.19.17.188"
              port_value: 12011
          -
            socket_address:
              address: "10.19.17.188"
              port_value: 12012
        outlier_detection:      # where amazing happened 
            consecutive_5xx: 1

您还可以使用以下方法之一:

  1. 在您的集群的 Envoy 中启用主动健康检查。为此,您还必须从您的服务中公开一个健康检查端点,这应该相当容易。参考 https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/core/health_check.proto

  2. 切换到使用额外组件的特使动态配置:a) 控制平面,例如 go-control-plane 和 b) 服务发现服务,例如 consul。这肯定会增加您的服务设置的复杂性,但它也会支持更动态和更强大的解决方案。