即使在发送 SIGTERM 之后,Kubernetes 也会将流量发送到 pod
Kubernetes sends traffic to the pod even after sending SIGTERM
我有一个SpringBoot项目配置了正常关机。部署在 k8s 1.12.7
这是日志,
2019-07-20 10:23:16.180 INFO [service,,,] 1 --- [ Thread-7] com.jay.util.GracefulShutdown : Received shutdown event
2019-07-20 10:23:16.180 INFO [service,,,] 1 --- [ Thread-7] com.jay.util.GracefulShutdown : Waiting for 30s to finish
2019-07-20 10:23:16.273 INFO [service,fd964ebaa631a860,75a07c123397e4ff,false] 1 --- [io-8080-exec-10] com.jay.resource.ProductResource : GET /products?id=59
2019-07-20 10:23:16.374 INFO [service,9a569ecd8c448e98,00bc11ef2776d7fb,false] 1 --- [nio-8080-exec-1] com.jay.resource.ProductResource : GET /products?id=68
...
2019-07-20 10:23:33.711 INFO [service,1532d6298acce718,08cfb8085553b02e,false] 1 --- [nio-8080-exec-9] com.jay.resource.ProductResource : GET /products?id=209
2019-07-20 10:23:46.181 INFO [service,,,] 1 --- [ Thread-7] com.jay.util.GracefulShutdown : Resumed after hibernation
2019-07-20 10:23:46.216 INFO [service,,,] 1 --- [ Thread-7] o.s.s.concurrent.ThreadPoolTaskExecutor : Shutting down ExecutorService 'applicationTaskExecutor'
应用程序已在 10:23:16.180
从 Kubernetes 收到 SIGTERM
。根据 Termination of Pods point#5
表示终止 pod 已从服务的端点列表中删除,但它在发送 [= 后转发请求 17 秒(直到 10:23:33.711
)是矛盾的19=] 信号。是否缺少任何配置?
Dockerfile
FROM openjdk:8-jre-slim
MAINTAINER Jay
RUN apt update && apt install -y curl libtcnative-1 gcc && apt clean
ADD build/libs/sample-service.jar /
CMD ["java", "-jar" , "sample-service.jar"]
GracefulShutdown
// https://github.com/spring-projects/spring-boot/issues/4657
class GracefulShutdown(val waitTime: Long, val timeout: Long) : TomcatConnectorCustomizer, ApplicationListener<ContextClosedEvent> {
@Volatile
private var connector: Connector? = null
override fun customize(connector: Connector) {
this.connector = connector
}
override fun onApplicationEvent(event: ContextClosedEvent) {
log.info("Received shutdown event")
val executor = this.connector?.protocolHandler?.executor
if (executor is ThreadPoolExecutor) {
try {
val threadPoolExecutor: ThreadPoolExecutor = executor
log.info("Waiting for ${waitTime}s to finish")
hibernate(waitTime * 1000)
log.info("Resumed after hibernation")
this.connector?.pause()
threadPoolExecutor.shutdown()
if (!threadPoolExecutor.awaitTermination(timeout, TimeUnit.SECONDS)) {
log.warn("Tomcat thread pool did not shut down gracefully within $timeout seconds. Proceeding with forceful shutdown")
threadPoolExecutor.shutdownNow()
if (!threadPoolExecutor.awaitTermination(timeout, TimeUnit.SECONDS)) {
log.error("Tomcat thread pool did not terminate")
}
}
} catch (ex: InterruptedException) {
log.info("Interrupted")
Thread.currentThread().interrupt()
}
}else
this.connector?.pause()
}
private fun hibernate(time: Long){
try {
Thread.sleep(time)
}catch (ex: Exception){}
}
companion object {
private val log = LoggerFactory.getLogger(GracefulShutdown::class.java)
}
}
@Configuration
class GracefulShutdownConfig(@Value("${app.shutdown.graceful.wait-time:30}") val waitTime: Long,
@Value("${app.shutdown.graceful.timeout:30}") val timeout: Long) {
companion object {
private val log = LoggerFactory.getLogger(GracefulShutdownConfig::class.java)
}
@Bean
fun gracefulShutdown(): GracefulShutdown {
return GracefulShutdown(waitTime, timeout)
}
@Bean
fun webServerFactory(gracefulShutdown: GracefulShutdown): ConfigurableServletWebServerFactory {
log.info("GracefulShutdown configured with wait: ${waitTime}s and timeout: ${timeout}s")
val factory = TomcatServletWebServerFactory()
factory.addConnectorCustomizers(gracefulShutdown)
return factory
}
}
deployment file
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
k8s-app: service
name: service
spec:
progressDeadlineSeconds: 420
replicas: 1
revisionHistoryLimit: 1
selector:
matchLabels:
k8s-app: service
strategy:
rollingUpdate:
maxSurge: 2
maxUnavailable: 0
type: RollingUpdate
template:
metadata:
labels:
k8s-app: service
spec:
terminationGracePeriodSeconds: 60
containers:
- env:
- name: SPRING_PROFILES_ACTIVE
value: dev
image: service:2
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 20
httpGet:
path: /actuator/health
port: 8080
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 5
name: service
ports:
- containerPort: 8080
protocol: TCP
readinessProbe:
failureThreshold: 60
httpGet:
path: /actuator/health
port: 8080
initialDelaySeconds: 100
periodSeconds: 10
timeoutSeconds: 5
更新:
添加了自定义健康检查端点
@RestControllerEndpoint(id = "live")
@Component
class LiveEndpoint {
companion object {
private val log = LoggerFactory.getLogger(LiveEndpoint::class.java)
}
@Autowired
private lateinit var gracefulShutdownStatus: GracefulShutdownStatus
@GetMapping
fun live(): ResponseEntity<Any> {
val status = if(gracefulShutdownStatus.isTerminating())
HttpStatus.INTERNAL_SERVER_ERROR.value()
else
HttpStatus.OK.value()
log.info("Status: $status")
return ResponseEntity.status(status).build()
}
}
更改了 livenessProbe
、
livenessProbe:
httpGet:
path: /actuator/live
port: 8080
initialDelaySeconds: 100
periodSeconds: 5
timeoutSeconds: 5
failureThreshold: 3
这是更改后的日志,
2019-07-21 14:13:01.431 INFO [service,9b65b26907f2cf8f,9b65b26907f2cf8f,false] 1 --- [nio-8080-exec-2] com.jay.util.LiveEndpoint : Status: 200
2019-07-21 14:13:01.444 INFO [service,3da259976f9c286c,64b0d5973fddd577,false] 1 --- [nio-8080-exec-3] com.jay.resource.ProductResource : GET /products?id=52
2019-07-21 14:13:01.609 INFO [service,,,] 1 --- [ Thread-7] com.jay.util.GracefulShutdown : Received shutdown event
2019-07-21 14:13:01.610 INFO [service,,,] 1 --- [ Thread-7] com.jay.util.GracefulShutdown : Waiting for 30s to finish
...
2019-07-21 14:13:06.431 INFO [service,002c0da2133cf3b0,002c0da2133cf3b0,false] 1 --- [nio-8080-exec-3] com.jay.util.LiveEndpoint : Status: 500
2019-07-21 14:13:06.433 INFO [service,072abbd7275103ce,d1ead06b4abf2a34,false] 1 --- [nio-8080-exec-4] com.jay.resource.ProductResource : GET /products?id=96
...
2019-07-21 14:13:11.431 INFO [service,35aa09a8aea64ae6,35aa09a8aea64ae6,false] 1 --- [io-8080-exec-10] com.jay.util.LiveEndpoint : Status: 500
2019-07-21 14:13:11.508 INFO [service,a78c924f75538a50,0314f77f21076313,false] 1 --- [nio-8080-exec-2] com.jay.resource.ProductResource : GET /products?id=110
...
2019-07-21 14:13:16.431 INFO [service,38a940dfda03956b,38a940dfda03956b,false] 1 --- [nio-8080-exec-9] com.jay.util.LiveEndpoint : Status: 500
2019-07-21 14:13:16.593 INFO [service,d76e81012934805f,b61cb062154bb7f0,false] 1 --- [io-8080-exec-10] com.jay.resource.ProductResource : GET /products?id=152
...
2019-07-21 14:13:29.634 INFO [service,38a32a20358a7cc4,2029de1ed90e9539,false] 1 --- [nio-8080-exec-6] com.jay.resource.ProductResource : GET /products?id=191
2019-07-21 14:13:31.610 INFO [service,,,] 1 --- [ Thread-7] com.jay.util.GracefulShutdown : Resumed after hibernation
2019-07-21 14:13:31.692 INFO [service,,,] 1 --- [ Thread-7] o.s.s.concurrent.ThreadPoolTaskExecutor : Shutting down ExecutorService 'applicationTaskExecutor'
在 livenessProbe
的 3 次失败中,kubernetes 在活动失败后为流量提供服务 13 秒,即从 14:13:16.431
到 14:13:29.634
。
更新 2:
事件的顺序(感谢Eamonn McEvoy
)
seconds | healthy | events
0 | ✔ | * liveness probe healthy
1 | ✔ | - SIGTERM
2 | ✔ |
3 | ✔ |
4 | ✔ |
5 | ✔ | * liveness probe unhealthy (1/3)
6 | ✔ |
7 | ✔ |
8 | ✔ |
9 | ✔ |
10 | ✔ | * liveness probe unhealthy (2/3)
11 | ✔ |
12 | ✔ |
13 | ✔ |
14 | ✔ |
15 | ✘ | * liveness probe unhealthy (3/3)
.. | ✔ | * traffic is served
28 | ✔ | * traffic is served
29 | ✘ | * pod restarts
SIGTERM 不会立即将 pod 置于终止状态。您可以在日志中看到您的应用程序在 10:23:16.180 处开始正常关闭,并且需要 >20 秒才能完成。此时容器停止,pod可以进入terminating状态。
就 kubernetes 而言,pod 在正常关闭期间看起来没问题。您需要在部署中添加一个 liveness probe;当它变得不健康时,流量将停止。
livenessProbe:
httpGet:
path: /actuator/health
port: 8080
initialDelaySeconds: 100
periodSeconds: 10
timeoutSeconds: 5
更新:
这是因为您的失败阈值为 3,所以您在 sigterm 后最多允许 15 秒的流量;
例如
seconds | healthy | events
0 | ✔ | * liveness probe healthy
1 | ✔ | - SIGTERM
2 | ✔ |
3 | ✔ |
4 | ✔ |
5 | ✔ | * liveness probe issued
6 | ✔ | .
7 | ✔ | .
8 | ✔ | .
9 | ✔ | .
10 | ✔ | * liveness probe timeout - unhealthy (1/3)
11 | ✔ |
12 | ✔ |
13 | ✔ |
14 | ✔ |
15 | ✔ | * liveness probe issued
16 | ✔ | .
17 | ✔ | .
18 | ✔ | .
19 | ✔ | .
20 | ✔ | * liveness probe timeout - unhealthy (2/3)
21 | ✔ |
22 | ✔ |
23 | ✔ |
24 | ✔ |
25 | ✔ | * liveness probe issued
26 | ✔ | .
27 | ✔ | .
28 | ✔ | .
29 | ✔ | .
30 | ✘ | * liveness probe timeout - unhealthy (3/3)
| | * pod restarts
这是假设端点 returns 在正常关闭期间响应不正常。由于您有 timeoutSeconds: 5
,如果探测只是超时,这将花费更长的时间,在发出 liveness 探测请求和接收其响应之间有 5 秒的延迟。可能是容器在达到活动阈值之前实际上已经死亡,而您仍然看到原始行为
我有一个SpringBoot项目配置了正常关机。部署在 k8s 1.12.7
这是日志,
2019-07-20 10:23:16.180 INFO [service,,,] 1 --- [ Thread-7] com.jay.util.GracefulShutdown : Received shutdown event
2019-07-20 10:23:16.180 INFO [service,,,] 1 --- [ Thread-7] com.jay.util.GracefulShutdown : Waiting for 30s to finish
2019-07-20 10:23:16.273 INFO [service,fd964ebaa631a860,75a07c123397e4ff,false] 1 --- [io-8080-exec-10] com.jay.resource.ProductResource : GET /products?id=59
2019-07-20 10:23:16.374 INFO [service,9a569ecd8c448e98,00bc11ef2776d7fb,false] 1 --- [nio-8080-exec-1] com.jay.resource.ProductResource : GET /products?id=68
...
2019-07-20 10:23:33.711 INFO [service,1532d6298acce718,08cfb8085553b02e,false] 1 --- [nio-8080-exec-9] com.jay.resource.ProductResource : GET /products?id=209
2019-07-20 10:23:46.181 INFO [service,,,] 1 --- [ Thread-7] com.jay.util.GracefulShutdown : Resumed after hibernation
2019-07-20 10:23:46.216 INFO [service,,,] 1 --- [ Thread-7] o.s.s.concurrent.ThreadPoolTaskExecutor : Shutting down ExecutorService 'applicationTaskExecutor'
应用程序已在 10:23:16.180
从 Kubernetes 收到 SIGTERM
。根据 Termination of Pods point#5
表示终止 pod 已从服务的端点列表中删除,但它在发送 [= 后转发请求 17 秒(直到 10:23:33.711
)是矛盾的19=] 信号。是否缺少任何配置?
Dockerfile
FROM openjdk:8-jre-slim
MAINTAINER Jay
RUN apt update && apt install -y curl libtcnative-1 gcc && apt clean
ADD build/libs/sample-service.jar /
CMD ["java", "-jar" , "sample-service.jar"]
GracefulShutdown
// https://github.com/spring-projects/spring-boot/issues/4657
class GracefulShutdown(val waitTime: Long, val timeout: Long) : TomcatConnectorCustomizer, ApplicationListener<ContextClosedEvent> {
@Volatile
private var connector: Connector? = null
override fun customize(connector: Connector) {
this.connector = connector
}
override fun onApplicationEvent(event: ContextClosedEvent) {
log.info("Received shutdown event")
val executor = this.connector?.protocolHandler?.executor
if (executor is ThreadPoolExecutor) {
try {
val threadPoolExecutor: ThreadPoolExecutor = executor
log.info("Waiting for ${waitTime}s to finish")
hibernate(waitTime * 1000)
log.info("Resumed after hibernation")
this.connector?.pause()
threadPoolExecutor.shutdown()
if (!threadPoolExecutor.awaitTermination(timeout, TimeUnit.SECONDS)) {
log.warn("Tomcat thread pool did not shut down gracefully within $timeout seconds. Proceeding with forceful shutdown")
threadPoolExecutor.shutdownNow()
if (!threadPoolExecutor.awaitTermination(timeout, TimeUnit.SECONDS)) {
log.error("Tomcat thread pool did not terminate")
}
}
} catch (ex: InterruptedException) {
log.info("Interrupted")
Thread.currentThread().interrupt()
}
}else
this.connector?.pause()
}
private fun hibernate(time: Long){
try {
Thread.sleep(time)
}catch (ex: Exception){}
}
companion object {
private val log = LoggerFactory.getLogger(GracefulShutdown::class.java)
}
}
@Configuration
class GracefulShutdownConfig(@Value("${app.shutdown.graceful.wait-time:30}") val waitTime: Long,
@Value("${app.shutdown.graceful.timeout:30}") val timeout: Long) {
companion object {
private val log = LoggerFactory.getLogger(GracefulShutdownConfig::class.java)
}
@Bean
fun gracefulShutdown(): GracefulShutdown {
return GracefulShutdown(waitTime, timeout)
}
@Bean
fun webServerFactory(gracefulShutdown: GracefulShutdown): ConfigurableServletWebServerFactory {
log.info("GracefulShutdown configured with wait: ${waitTime}s and timeout: ${timeout}s")
val factory = TomcatServletWebServerFactory()
factory.addConnectorCustomizers(gracefulShutdown)
return factory
}
}
deployment file
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
k8s-app: service
name: service
spec:
progressDeadlineSeconds: 420
replicas: 1
revisionHistoryLimit: 1
selector:
matchLabels:
k8s-app: service
strategy:
rollingUpdate:
maxSurge: 2
maxUnavailable: 0
type: RollingUpdate
template:
metadata:
labels:
k8s-app: service
spec:
terminationGracePeriodSeconds: 60
containers:
- env:
- name: SPRING_PROFILES_ACTIVE
value: dev
image: service:2
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 20
httpGet:
path: /actuator/health
port: 8080
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 5
name: service
ports:
- containerPort: 8080
protocol: TCP
readinessProbe:
failureThreshold: 60
httpGet:
path: /actuator/health
port: 8080
initialDelaySeconds: 100
periodSeconds: 10
timeoutSeconds: 5
更新:
添加了自定义健康检查端点
@RestControllerEndpoint(id = "live")
@Component
class LiveEndpoint {
companion object {
private val log = LoggerFactory.getLogger(LiveEndpoint::class.java)
}
@Autowired
private lateinit var gracefulShutdownStatus: GracefulShutdownStatus
@GetMapping
fun live(): ResponseEntity<Any> {
val status = if(gracefulShutdownStatus.isTerminating())
HttpStatus.INTERNAL_SERVER_ERROR.value()
else
HttpStatus.OK.value()
log.info("Status: $status")
return ResponseEntity.status(status).build()
}
}
更改了 livenessProbe
、
livenessProbe:
httpGet:
path: /actuator/live
port: 8080
initialDelaySeconds: 100
periodSeconds: 5
timeoutSeconds: 5
failureThreshold: 3
这是更改后的日志,
2019-07-21 14:13:01.431 INFO [service,9b65b26907f2cf8f,9b65b26907f2cf8f,false] 1 --- [nio-8080-exec-2] com.jay.util.LiveEndpoint : Status: 200
2019-07-21 14:13:01.444 INFO [service,3da259976f9c286c,64b0d5973fddd577,false] 1 --- [nio-8080-exec-3] com.jay.resource.ProductResource : GET /products?id=52
2019-07-21 14:13:01.609 INFO [service,,,] 1 --- [ Thread-7] com.jay.util.GracefulShutdown : Received shutdown event
2019-07-21 14:13:01.610 INFO [service,,,] 1 --- [ Thread-7] com.jay.util.GracefulShutdown : Waiting for 30s to finish
...
2019-07-21 14:13:06.431 INFO [service,002c0da2133cf3b0,002c0da2133cf3b0,false] 1 --- [nio-8080-exec-3] com.jay.util.LiveEndpoint : Status: 500
2019-07-21 14:13:06.433 INFO [service,072abbd7275103ce,d1ead06b4abf2a34,false] 1 --- [nio-8080-exec-4] com.jay.resource.ProductResource : GET /products?id=96
...
2019-07-21 14:13:11.431 INFO [service,35aa09a8aea64ae6,35aa09a8aea64ae6,false] 1 --- [io-8080-exec-10] com.jay.util.LiveEndpoint : Status: 500
2019-07-21 14:13:11.508 INFO [service,a78c924f75538a50,0314f77f21076313,false] 1 --- [nio-8080-exec-2] com.jay.resource.ProductResource : GET /products?id=110
...
2019-07-21 14:13:16.431 INFO [service,38a940dfda03956b,38a940dfda03956b,false] 1 --- [nio-8080-exec-9] com.jay.util.LiveEndpoint : Status: 500
2019-07-21 14:13:16.593 INFO [service,d76e81012934805f,b61cb062154bb7f0,false] 1 --- [io-8080-exec-10] com.jay.resource.ProductResource : GET /products?id=152
...
2019-07-21 14:13:29.634 INFO [service,38a32a20358a7cc4,2029de1ed90e9539,false] 1 --- [nio-8080-exec-6] com.jay.resource.ProductResource : GET /products?id=191
2019-07-21 14:13:31.610 INFO [service,,,] 1 --- [ Thread-7] com.jay.util.GracefulShutdown : Resumed after hibernation
2019-07-21 14:13:31.692 INFO [service,,,] 1 --- [ Thread-7] o.s.s.concurrent.ThreadPoolTaskExecutor : Shutting down ExecutorService 'applicationTaskExecutor'
在 livenessProbe
的 3 次失败中,kubernetes 在活动失败后为流量提供服务 13 秒,即从 14:13:16.431
到 14:13:29.634
。
更新 2:
事件的顺序(感谢Eamonn McEvoy
)
seconds | healthy | events
0 | ✔ | * liveness probe healthy
1 | ✔ | - SIGTERM
2 | ✔ |
3 | ✔ |
4 | ✔ |
5 | ✔ | * liveness probe unhealthy (1/3)
6 | ✔ |
7 | ✔ |
8 | ✔ |
9 | ✔ |
10 | ✔ | * liveness probe unhealthy (2/3)
11 | ✔ |
12 | ✔ |
13 | ✔ |
14 | ✔ |
15 | ✘ | * liveness probe unhealthy (3/3)
.. | ✔ | * traffic is served
28 | ✔ | * traffic is served
29 | ✘ | * pod restarts
SIGTERM 不会立即将 pod 置于终止状态。您可以在日志中看到您的应用程序在 10:23:16.180 处开始正常关闭,并且需要 >20 秒才能完成。此时容器停止,pod可以进入terminating状态。
就 kubernetes 而言,pod 在正常关闭期间看起来没问题。您需要在部署中添加一个 liveness probe;当它变得不健康时,流量将停止。
livenessProbe:
httpGet:
path: /actuator/health
port: 8080
initialDelaySeconds: 100
periodSeconds: 10
timeoutSeconds: 5
更新:
这是因为您的失败阈值为 3,所以您在 sigterm 后最多允许 15 秒的流量;
例如
seconds | healthy | events
0 | ✔ | * liveness probe healthy
1 | ✔ | - SIGTERM
2 | ✔ |
3 | ✔ |
4 | ✔ |
5 | ✔ | * liveness probe issued
6 | ✔ | .
7 | ✔ | .
8 | ✔ | .
9 | ✔ | .
10 | ✔ | * liveness probe timeout - unhealthy (1/3)
11 | ✔ |
12 | ✔ |
13 | ✔ |
14 | ✔ |
15 | ✔ | * liveness probe issued
16 | ✔ | .
17 | ✔ | .
18 | ✔ | .
19 | ✔ | .
20 | ✔ | * liveness probe timeout - unhealthy (2/3)
21 | ✔ |
22 | ✔ |
23 | ✔ |
24 | ✔ |
25 | ✔ | * liveness probe issued
26 | ✔ | .
27 | ✔ | .
28 | ✔ | .
29 | ✔ | .
30 | ✘ | * liveness probe timeout - unhealthy (3/3)
| | * pod restarts
这是假设端点 returns 在正常关闭期间响应不正常。由于您有 timeoutSeconds: 5
,如果探测只是超时,这将花费更长的时间,在发出 liveness 探测请求和接收其响应之间有 5 秒的延迟。可能是容器在达到活动阈值之前实际上已经死亡,而您仍然看到原始行为