在 Google Kubernetes Engine 上使用 Apache Spark 进行结构化日志记录

Question

我是运行 Apache Spark 应用程序，在 Google Kubernetes Engine 集群上，它将任何输出从 STDOUT 和 STDERR 传播到 Cloud Logging。但是，不会传播粒度日志严重性级别。 Cloud Logging 中的所有消息将只有 INFO 或 ERROR 严重性（取决于它是写入 stdout 还是 stderr）并且实际严重性级别隐藏在文本属性.

中

我的目标是格式化 Structured Logging JSON format so that the severity level is propagated to Cloud Logging. Unfortunately, Apache Spark still uses the deprecated log4j 1.x 库中的消息以进行日志记录，我想知道如何以 Cloud Logging 可以正确获取它们的方式格式化日志消息。

到目前为止，我使用以下默认 log4j.properties 文件：

log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

Answer 1

在 GKE 集群中启用 Cloud Logging 时，日志记录由 GKE 管理，因此无法像在 GCE 实例中那样轻松地更改日志格式。

要在 GKE 中推送 JSON 格式日志，您可以尝试以下选项：

让您的软件以 JSON 格式推送日志，这样 Cloud Logging 将检测 JSON 格式的日志条目并以此格式推送它们。
按照 here and set up your own parser 中的建议管理您自己的 fluentd 版本，但解决方案将由您管理，不再是 GKE。
添加一个 sidecar 容器来读取您的日志并将它们转换为 JSON，然后将 JSON 转储到标准输出。 GKE 中的日志记录代理会将边车的日志提取为 JSON.

请记住，在使用选项三时，有一些注意事项可能会导致大量资源消耗，并且您将无法使用 kubectl 日志，如 here 所述。

在 Google Kubernetes Engine 上使用 Apache Spark 进行结构化日志记录

Structured Logging with Apache Spark on Google Kubernetes Engine

log4j

apache-spark

google-kubernetes-engine

google-cloud-logging

stackdriver