特使熔断非确定性行为
Envoy Circuit Breaking Non Deterministic Behaviour
我们对 envoy 的 circuit breaking 进行的实验表明结果不是确定性的。我们尝试使用如下设置故意使电路跳闸证明了这一点:
该服务是一个简单的 Web 服务器,它 return 是一个 200
具有 2 秒时间延迟(时间延迟确保服务器在异步请求之间保持忙碌)。我们的 envoy sidecar 配置快照显示我们启用了熔断(超过 http/1.1),最多有 1 个连接和 1 个待处理请求:
circuit_breakers:
thresholds:
- priority: "DEFAULT"
max_connections: 1
max_pending_requests: 1
接下来,我们通过向服务发送单个请求来测试它是否有效,它会按预期可靠地响应 200
。
但是,如果我们现在向服务发送 2 个异步请求,我们会看到意想不到的结果。它有时 returns 200
对于它不应该能够的两个请求,因为第二个请求应该使断路器跳闸。在其他情况下,一个请求 return 是 200
,另一个 return 是 503 Service Unavailable
,这是我们期望发生的情况。尽管我们尽了最大努力,但我们无法实现任何类型的可重复性,这让我们认为这与 Envoy 的底层并发性有关。
当我们将 max_connections
和 max_pending_requests
更改为更大的数字 (>100) 并再次发送太多请求以试图使电路跳闸时,我们发现这种不一致仍然存在。允许的请求数量大致正确,但有时会出现一些偏差。
我们希望了解这种缺乏绝对决定论的原因。任何帮助深表感谢!代码见 repo
编辑:有一个 issue 详细描述了类似的意外行为,但我离找到解决方案还差得很远。
我已经包含了两个请求的日志来演示输出:
- 同时发送 3 个请求,1 个通过。
❯ (printf '%s\n' {1..3}) | xargs -I % -P 20 curl -v "http://localhost:3000?status=200&sleep=2"
** Trying ::1...
Trying ::1...
** TCP_NODELAY set
TCP_NODELAY set
* Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 3000 (#0)
* Connected to localhost (::1) port 3000 (#0)
> GET /?status=200&sleep=2 HTTP/1.1
>> GET /?status=200&sleep=2 HTTP/1.1
Host: localhost:3000
>> Host: localhost:3000
User-Agent: curl/7.64.1
>> User-Agent: curl/7.64.1
Accept: */*
>> Accept: */*
>
* Connected to localhost (::1) port 3000 (#0)
> GET /?status=200&sleep=2 HTTP/1.1
> Host: localhost:3000
> User-Agent: curl/7.64.1
> Accept: */*
>
< HTTP/1.1 503 Service Unavailable
< content-length: 81
< content-type: text/plain
< x-envoy-overloaded: true
< date: Wed, 12 Feb 2020 03:36:29 GMT
< server: envoy
<
* Connection #0 to host localhost left intact
upstream connect error or disconnect/reset before headers. reset reason: overflow* Closing connection 0
< HTTP/1.1 503 Service Unavailable
< content-length: 81
< content-type: text/plain
< x-envoy-overloaded: true
< date: Wed, 12 Feb 2020 03:36:29 GMT
< server: envoy
<
* Connection #0 to host localhost left intact
upstream connect error or disconnect/reset before headers. reset reason: overflow* Closing connection 0
< HTTP/1.1 200 OK
< content-type: text/html; charset=utf-8
< content-length: 3
< server: envoy
< date: Wed, 12 Feb 2020 03:36:31 GMT
< x-envoy-upstream-service-time: 2007
<
* Connection #0 to host localhost left intact
200* Closing connection 0
- 同时发送 3 个请求,全部 return 200。
❯ (printf '%s\n' {1..3}) | xargs -I % -P 20 curl -v "http://localhost:3000?status=200&sleep=2"
** Trying ::1...
Trying ::1...
** TCP_NODELAY set
TCP_NODELAY set
* * Trying ::1...
*Connected to localhost (::1) port 3000 (#0)
* TCP_NODELAY set
Connected to localhost (::1) port 3000 (#0)
> GET /?status=200&sleep=2 HTTP/1.1
> >Host: localhost:3000
>GET /?status=200&sleep=2 HTTP/1.1
User-Agent: curl/7.64.1
>> Accept: */*
Host: localhost:3000
> >
User-Agent: curl/7.64.1
> Accept: */*
>
* Connected to localhost (::1) port 3000 (#0)
> GET /?status=200&sleep=2 HTTP/1.1
> Host: localhost:3000
> User-Agent: curl/7.64.1
> Accept: */*
>
< HTTP/1.1 200 OK
< content-type: text/html; charset=utf-8
< content-length: 3
< server: envoy
< date: Wed, 12 Feb 2020 03:40:50 GMT
< x-envoy-upstream-service-time: 2006
<
* Connection #0 to host localhost left intact
200* Closing connection 0
< HTTP/1.1 200 OK
< content-type: text/html; charset=utf-8
< content-length: 3
< server: envoy
< date: Wed, 12 Feb 2020 03:40:52 GMT
< x-envoy-upstream-service-time: 4011
<
* Connection #0 to host localhost left intact
200* Closing connection 0
< HTTP/1.1 200 OK
< content-type: text/html; charset=utf-8
< content-length: 3
< server: envoy
< date: Wed, 12 Feb 2020 03:40:54 GMT
< x-envoy-upstream-service-time: 6015
<
* Connection #0 to host localhost left intact
200* Closing connection 0
来自 here 的一位贡献者:
The circuit breakers are intended to prevent too much load from propagating through the system, not enforce a strict limit. The system is implemented in a way that is simpler and more performant, but can slightly exceed the limits in some cases. Here's a comment from the implementation of the circuit breaker limit tracking
我们对 envoy 的 circuit breaking 进行的实验表明结果不是确定性的。我们尝试使用如下设置故意使电路跳闸证明了这一点:
该服务是一个简单的 Web 服务器,它 return 是一个 200
具有 2 秒时间延迟(时间延迟确保服务器在异步请求之间保持忙碌)。我们的 envoy sidecar 配置快照显示我们启用了熔断(超过 http/1.1),最多有 1 个连接和 1 个待处理请求:
circuit_breakers:
thresholds:
- priority: "DEFAULT"
max_connections: 1
max_pending_requests: 1
接下来,我们通过向服务发送单个请求来测试它是否有效,它会按预期可靠地响应 200
。
但是,如果我们现在向服务发送 2 个异步请求,我们会看到意想不到的结果。它有时 returns 200
对于它不应该能够的两个请求,因为第二个请求应该使断路器跳闸。在其他情况下,一个请求 return 是 200
,另一个 return 是 503 Service Unavailable
,这是我们期望发生的情况。尽管我们尽了最大努力,但我们无法实现任何类型的可重复性,这让我们认为这与 Envoy 的底层并发性有关。
当我们将 max_connections
和 max_pending_requests
更改为更大的数字 (>100) 并再次发送太多请求以试图使电路跳闸时,我们发现这种不一致仍然存在。允许的请求数量大致正确,但有时会出现一些偏差。
我们希望了解这种缺乏绝对决定论的原因。任何帮助深表感谢!代码见 repo
编辑:有一个 issue 详细描述了类似的意外行为,但我离找到解决方案还差得很远。
我已经包含了两个请求的日志来演示输出:
- 同时发送 3 个请求,1 个通过。
❯ (printf '%s\n' {1..3}) | xargs -I % -P 20 curl -v "http://localhost:3000?status=200&sleep=2"
** Trying ::1...
Trying ::1...
** TCP_NODELAY set
TCP_NODELAY set
* Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 3000 (#0)
* Connected to localhost (::1) port 3000 (#0)
> GET /?status=200&sleep=2 HTTP/1.1
>> GET /?status=200&sleep=2 HTTP/1.1
Host: localhost:3000
>> Host: localhost:3000
User-Agent: curl/7.64.1
>> User-Agent: curl/7.64.1
Accept: */*
>> Accept: */*
>
* Connected to localhost (::1) port 3000 (#0)
> GET /?status=200&sleep=2 HTTP/1.1
> Host: localhost:3000
> User-Agent: curl/7.64.1
> Accept: */*
>
< HTTP/1.1 503 Service Unavailable
< content-length: 81
< content-type: text/plain
< x-envoy-overloaded: true
< date: Wed, 12 Feb 2020 03:36:29 GMT
< server: envoy
<
* Connection #0 to host localhost left intact
upstream connect error or disconnect/reset before headers. reset reason: overflow* Closing connection 0
< HTTP/1.1 503 Service Unavailable
< content-length: 81
< content-type: text/plain
< x-envoy-overloaded: true
< date: Wed, 12 Feb 2020 03:36:29 GMT
< server: envoy
<
* Connection #0 to host localhost left intact
upstream connect error or disconnect/reset before headers. reset reason: overflow* Closing connection 0
< HTTP/1.1 200 OK
< content-type: text/html; charset=utf-8
< content-length: 3
< server: envoy
< date: Wed, 12 Feb 2020 03:36:31 GMT
< x-envoy-upstream-service-time: 2007
<
* Connection #0 to host localhost left intact
200* Closing connection 0
- 同时发送 3 个请求,全部 return 200。
❯ (printf '%s\n' {1..3}) | xargs -I % -P 20 curl -v "http://localhost:3000?status=200&sleep=2"
** Trying ::1...
Trying ::1...
** TCP_NODELAY set
TCP_NODELAY set
* * Trying ::1...
*Connected to localhost (::1) port 3000 (#0)
* TCP_NODELAY set
Connected to localhost (::1) port 3000 (#0)
> GET /?status=200&sleep=2 HTTP/1.1
> >Host: localhost:3000
>GET /?status=200&sleep=2 HTTP/1.1
User-Agent: curl/7.64.1
>> Accept: */*
Host: localhost:3000
> >
User-Agent: curl/7.64.1
> Accept: */*
>
* Connected to localhost (::1) port 3000 (#0)
> GET /?status=200&sleep=2 HTTP/1.1
> Host: localhost:3000
> User-Agent: curl/7.64.1
> Accept: */*
>
< HTTP/1.1 200 OK
< content-type: text/html; charset=utf-8
< content-length: 3
< server: envoy
< date: Wed, 12 Feb 2020 03:40:50 GMT
< x-envoy-upstream-service-time: 2006
<
* Connection #0 to host localhost left intact
200* Closing connection 0
< HTTP/1.1 200 OK
< content-type: text/html; charset=utf-8
< content-length: 3
< server: envoy
< date: Wed, 12 Feb 2020 03:40:52 GMT
< x-envoy-upstream-service-time: 4011
<
* Connection #0 to host localhost left intact
200* Closing connection 0
< HTTP/1.1 200 OK
< content-type: text/html; charset=utf-8
< content-length: 3
< server: envoy
< date: Wed, 12 Feb 2020 03:40:54 GMT
< x-envoy-upstream-service-time: 6015
<
* Connection #0 to host localhost left intact
200* Closing connection 0
来自 here 的一位贡献者:
The circuit breakers are intended to prevent too much load from propagating through the system, not enforce a strict limit. The system is implemented in a way that is simpler and more performant, but can slightly exceed the limits in some cases. Here's a comment from the implementation of the circuit breaker limit tracking