看似无辜的 eks worker ami 升级后，Kong k8s 部署失败

Question

AWS AMI 工作人员升级到新版本后，我们在 k8s 上的 kong 部署失败。
香港版本：1.4
旧的 ami 版本：amazon-eks-node-1.14-v20200423
新的 ami 版本：amazon-eks-node-1.14-v20200723
kubernetes 版本：1.14

我看到新的 AMI 附带了一个新的 docker 版本：19.03.06，而旧版本附带了 18.09.09。这会导致问题吗？

我可以在 kong pod 日志中看到很多 signal 9 退出：

2020/08/11 09:00:48 [notice] 1#0: using the "epoll" event method
2020/08/11 09:00:48 [notice] 1#0: openresty/1.15.8.2
2020/08/11 09:00:48 [notice] 1#0: built by gcc 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) 
2020/08/11 09:00:48 [notice] 1#0: OS: Linux 4.14.181-140.257.amzn2.x86_64
2020/08/11 09:00:48 [notice] 1#0: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2020/08/11 09:00:48 [notice] 1#0: start worker processes
2020/08/11 09:00:48 [notice] 1#0: start worker process 38
2020/08/11 09:00:48 [notice] 1#0: start worker process 39
2020/08/11 09:00:48 [notice] 1#0: start worker process 40
2020/08/11 09:00:48 [notice] 1#0: start worker process 41
2020/08/11 09:00:50 [notice] 1#0: signal 17 (SIGCHLD) received from 40
2020/08/11 09:00:50 [alert] 1#0: worker process 40 exited on signal 9
2020/08/11 09:00:50 [notice] 1#0: start worker process 42
2020/08/11 09:00:51 [notice] 1#0: signal 17 (SIGCHLD) received from 39
2020/08/11 09:00:51 [alert] 1#0: worker process 39 exited on signal 9
2020/08/11 09:00:51 [notice] 1#0: start worker process 43
2020/08/11 09:00:52 [notice] 1#0: signal 17 (SIGCHLD) received from 41
2020/08/11 09:00:52 [alert] 1#0: worker process 41 exited on signal 9
2020/08/11 09:00:52 [notice] 1#0: signal 29 (SIGIO) received
2020/08/11 09:00:52 [notice] 1#0: start worker process 44
2020/08/11 09:00:48 [debug] 38#0: *1 [lua] globalpatches.lua:243: randomseed(): seeding PRNG from OpenSSL RAND_bytes()
2020/08/11 09:00:48 [debug] 38#0: *1 [lua] globalpatches.lua:269: randomseed(): random seed: 255136921215 for worker nb 0
2020/08/11 09:00:48 [debug] 38#0: *1 [lua] events.lua:211: do_event_json(): worker-events: handling event; source=resty-worker-events, event=started, pid=38, data=nil
2020/08/11 09:00:48 [notice] 38#0: *1 [lua] cache_warmup.lua:42: cache_warmup_single_entity(): Preloading 'services' into the cache ..., context: init_worker_by_lua*
2020/08/11 09:00:48 [warn] 38#0: *1 [lua] socket.lua:159: tcp(): no support for cosockets in this context, falling back to LuaSocket, context: init_worker_by_lua*
2020/08/11 09:00:53 [notice] 1#0: signal 17 (SIGCHLD) received from 38
2020/08/11 09:00:53 [alert] 1#0: worker process 38 exited on signal 9
2020/08/11 09:00:53 [notice] 1#0: start worker process 45
2020/08/11 09:00:54 [notice] 1#0: signal 17 (SIGCHLD) received from 42
2020/08/11 09:00:54 [alert] 1#0: worker process 42 exited on signal 9
2020/08/11 09:00:54 [notice] 1#0: signal 29 (SIGIO) received
2020/08/11 09:00:54 [notice] 1#0: start worker process 46
2020/08/11 09:00:55 [notice] 1#0: signal 29 (SIGIO) received
2020/08/11 09:00:55 [notice] 1#0: signal 17 (SIGCHLD) received from 43
2020/08/11 09:00:55 [alert] 1#0: worker process 43 exited on signal 9
2020/08/11 09:00:55 [notice] 1#0: start worker process 47
2020/08/11 09:00:56 [notice] 1#0: signal 17 (SIGCHLD) received from 44
2020/08/11 09:00:56 [alert] 1#0: worker process 44 exited on signal 9
2020/08/11 09:00:56 [notice] 1#0: signal 29 (SIGIO) received
2020/08/11 09:00:56 [notice] 1#0: start worker process 48
2020/08/11 09:00:56 [notice] 1#0: signal 17 (SIGCHLD) received from 45
2020/08/11 09:00:56 [alert] 1#0: worker process 45 exited on signal 9
2020/08/11 09:00:58 [notice] 1#0: signal 29 (SIGIO) received
2020/08/11 09:00:58 [notice] 1#0: start worker process 49
2020/08/11 09:00:59 [notice] 1#0: signal 17 (SIGCHLD) received from 46
2020/08/11 09:00:59 [alert] 1#0: worker process 46 exited on signal 9
2020/08/11 09:00:59 [notice] 1#0: signal 29 (SIGIO) received
2020/08/11 09:00:59 [notice] 1#0: start worker process 50
2020/08/11 09:00:59 [notice] 1#0: signal 17 (SIGCHLD) received from 47

唯一的关键信息是：
[crit] 235#0: *45 [lua] balancer.lua:749: init(): failed loading initial list of upstreams: failed to get from node cache: could not acquire callback lock: timeout, context: ngx.timer

看着kubectl describe pod kong...我明白OOMKilled 这可能是内存问题吗？

Answer 1

新节点 ami ulimit (nofile) 已更改为 1048576，与 65536 相比有很大变化，导致我们当前的 Kong 设置出现内存问题，因此无法部署。

将新节点文件限制更改为以前的值修复了 kong 部署。
虽然我们决定改为增加 Kong 内存请求，但这也解决了这个问题。

看似无辜的 eks worker ami 升级后，Kong k8s 部署失败

Kong k8s deployment fails after seemingly innocent eks worker ami upgrade

out-of-memory

ulimit

kong

amazon-eks

amazon-ami