stackdriver-metadata-agent-cluster-level 被 OOMKilled

Question

我将 GKE 集群从 1.13 更新到 1.15.9-gke.12。在此过程中，我从传统日志记录切换到 Stackdriver Kubernetes Engine Monitoring。现在我遇到了 stackdriver-metadata-agent-cluster-level pod 不断重启的问题，因为它得到 OOMKilled.

不过记忆好像还好。

日志看起来也很好（与新创建的集群的日志相同）：

I0305 08:32:33.436613       1 log_spam.go:42] Command line arguments:
I0305 08:32:33.436726       1 log_spam.go:44]  argv[0]: '/k8s_metadata'
I0305 08:32:33.436753       1 log_spam.go:44]  argv[1]: '-logtostderr'
I0305 08:32:33.436779       1 log_spam.go:44]  argv[2]: '-v=1'
I0305 08:32:33.436818       1 log_spam.go:46] Process id 1
I0305 08:32:33.436859       1 log_spam.go:50] Current working directory /
I0305 08:32:33.436901       1 log_spam.go:52] Built on Jun 27 20:15:21 (1561666521)
 at gcm-agent-dev-releaser@ikle14.prod.google.com:/google/src/files/255462966/depot/branches/gcm_k8s_metadata_release_branch/255450506.1/OVERLAY_READONLY/google3
 as //cloud/monitoring/agents/k8s_metadata:k8s_metadata
 with gc go1.12.5 for linux/amd64
 from changelist 255462966 with baseline 255450506 in a mint client based on //depot/branches/gcm_k8s_metadata_release_branch/255450506.1/google3
Build label: gcm_k8s_metadata_20190627a_RC00
Build tool: Blaze, release blaze-2019.06.17-2 (mainline @253503028)
Build target: //cloud/monitoring/agents/k8s_metadata:k8s_metadata
I0305 08:32:33.437188       1 trace.go:784] Starting tracingd dapper tracing
I0305 08:32:33.437315       1 trace.go:898] Failed loading config; disabling tracing: open /export/hda3/trace_data/trace_config.proto: no such file or directory
W0305 08:32:33.536093       1 client_config.go:549] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0305 08:32:33.936066       1 main.go:134] Initiating watch for { v1 nodes} resources
I0305 08:32:33.936169       1 main.go:134] Initiating watch for { v1 pods} resources
I0305 08:32:33.936231       1 main.go:134] Initiating watch for {batch v1beta1 cronjobs} resources
I0305 08:32:33.936297       1 main.go:134] Initiating watch for {apps v1 daemonsets} resources
I0305 08:32:33.936361       1 main.go:134] Initiating watch for {extensions v1beta1 daemonsets} resources
I0305 08:32:33.936420       1 main.go:134] Initiating watch for {apps v1 deployments} resources
I0305 08:32:33.936489       1 main.go:134] Initiating watch for {extensions v1beta1 deployments} resources
I0305 08:32:33.936552       1 main.go:134] Initiating watch for { v1 endpoints} resources
I0305 08:32:33.936627       1 main.go:134] Initiating watch for {extensions v1beta1 ingresses} resources
I0305 08:32:33.936698       1 main.go:134] Initiating watch for {batch v1 jobs} resources
I0305 08:32:33.936777       1 main.go:134] Initiating watch for { v1 namespaces} resources
I0305 08:32:33.936841       1 main.go:134] Initiating watch for {apps v1 replicasets} resources
I0305 08:32:33.936897       1 main.go:134] Initiating watch for {extensions v1beta1 replicasets} resources
I0305 08:32:33.936986       1 main.go:134] Initiating watch for { v1 replicationcontrollers} resources
I0305 08:32:33.937067       1 main.go:134] Initiating watch for { v1 services} resources
I0305 08:32:33.937135       1 main.go:134] Initiating watch for {apps v1 statefulsets} resources
I0305 08:32:33.937157       1 main.go:142] All resources are being watched, agent has started successfully
I0305 08:32:33.937168       1 main.go:145] No statusz port provided; not starting a server
I0305 08:32:37.134913       1 binarylog.go:95] Starting disk-based binary logging
I0305 08:32:37.134965       1 binarylog.go:265] rpc: flushed binary log to ""

我已经尝试禁用日志记录并重新启用它但没有成功。它一直在重新启动（每分钟或多或少）。

有没有人有同样的经历？

Answer 1

导致此问题的原因是 metadata-agent 部署上设置的 LIMIT 资源太少，因此 POD 被终止（OOM 终止），因为 POD 需要更多内存才能正常工作。

在解决此问题之前有解决方法。

您可以覆盖 metadata-agent 的 configmap 中的基本资源：

kubectl edit cm -n kube-system metadata-agent-config

设置 baseMemory: 50Mi 应该足够了，如果它不起作用使用更高的值 100Mi 或 200Mi.

所以 metadata-agent-config configmap 应该看起来像这样：

apiVersion: v1
data:
  NannyConfiguration: |-
    apiVersion: nannyconfig/v1alpha1
    kind: NannyConfiguration
    baseMemory: 50Mi
kind: ConfigMap

另请注意，您需要重新启动部署，因为不会自动获取配置映射：

kubectl delete deployment -n kube-system stackdriver-metadata-agent-cluster-level

有关更多详细信息，请查看 addon-resizer Documentation。

Answer 2

我正要向 GCP 开一个支持票，但他们有这样的通知：

Description We are experiencing issue with Fluentd crashlooping in Google Kubernetes Engine where master version is 1.14 or 1.15, when gVisor is enabled. The fix is targeted for a release aiming to begin on 17 April 2020. We will provide more updates as the date gets closer. We will provide an update by Thursday, 2020-04-09 14:30 US/Pacific with current details. We apologize to all who are affected by the disruption.

Start time April 2, 2020 at 10:58:24 AM GMT-7

End time Steps to reproduce Fluentd crashloops in GKE clusters could lead to missing logs.

Workaround Upgrade Google Kubernetes Engine cluster masters to version 1.16+.

Affected products Other

stackdriver-metadata-agent-cluster-level 被 OOMKilled

stackdriver-metadata-agent-cluster-level gets OOMKilled

logging

kubernetes

google-kubernetes-engine

stackdriver