运行 Azure Linux 上的 Ops Agent

Run Ops Agent on Linux in Azure

我希望能够使用 Google 云日志和监控来监控(日志、性能指标)Azure(和其他云)中的 VM。

作为概念验证,

当我检查 Ops Agent 的状态时,我看到以下内容(略微编辑)

● google-cloud-ops-agent-opentelemetry-collector.service - Google Cloud Ops Agent - Metrics Agent
     Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service; static; vendor preset: enabled)
     Active: failed (Result: exit-code) since Wed 2022-02-16 22:39:22 UTC; 1min 5s ago
    Process: 2730195 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=otel -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} (code=exited, status=0/SUCCESS)
    Process: 2730208 ExecStart=/opt/google-cloud-ops-agent/subagents/opentelemetry-collector/otelopscol --config=${RUNTIME_DIRECTORY}/otel.yaml (code=exited, status=1/FAILURE)
   Main PID: 2730208 (code=exited, status=1/FAILURE)

Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-opentelemetry-collector.service: Scheduled restart job, restart counter is at 5.
Feb 16 22:39:22 HOSTNAME systemd[1]: Stopped Google Cloud Ops Agent - Metrics Agent.
Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-opentelemetry-collector.service: Start request repeated too quickly.
Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-opentelemetry-collector.service: Failed with result 'exit-code'.
Feb 16 22:39:22 HOSTNAME systemd[1]: Failed to start Google Cloud Ops Agent - Metrics Agent.

● google-cloud-ops-agent-fluent-bit.service - Google Cloud Ops Agent - Logging Agent
     Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service; static; vendor preset: enabled)
     Active: failed (Result: exit-code) since Wed 2022-02-16 22:39:22 UTC; 1min 5s ago
    Process: 2730194 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=fluentbit -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} -state ${STATE_DIRECTORY} (code=exited, status=0/SUCCESS)
    Process: 2730207 ExecStart=/opt/google-cloud-ops-agent/subagents/fluent-bit/bin/fluent-bit --config ${RUNTIME_DIRECTORY}/fluent_bit_main.conf --parser ${RUNTIME_DIRECTORY}/fluent_bit_parser.conf --log_file ${LOGS_DIRECTORY}/logging-module.log --storage_path ${STATE_DIRECTORY}/buffers (co>
   Main PID: 2730207 (code=exited, status=255/EXCEPTION)

Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-fluent-bit.service: Scheduled restart job, restart counter is at 5.
Feb 16 22:39:22 HOSTNAME systemd[1]: Stopped Google Cloud Ops Agent - Logging Agent.
Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-fluent-bit.service: Start request repeated too quickly.
Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-fluent-bit.service: Failed with result 'exit-code'.
Feb 16 22:39:22 HOSTNAME systemd[1]: Failed to start Google Cloud Ops Agent - Logging Agent.

● google-cloud-ops-agent.service - Google Cloud Ops Agent
     Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent.service; enabled; vendor preset: enabled)
     Active: active (exited) since Wed 2022-02-16 22:39:21 UTC; 1min 7s ago
    Process: 2730090 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -in /etc/google-cloud-ops-agent/config.yaml (code=exited, status=0/SUCCESS)
    Process: 2730102 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
   Main PID: 2730102 (code=exited, status=0/SUCCESS)

Feb 16 22:39:21 HOSTNAME systemd[1]: Starting Google Cloud Ops Agent...
Feb 16 22:39:21 HOSTNAME systemd[1]: Finished Google Cloud Ops Agent.

Ops Agent 日志显示

[2022/02/16 22:39:22] [ info] [engine] started (pid=2730207)
[2022/02/16 22:39:22] [ info] [storage] version=1.1.5, initializing...
[2022/02/16 22:39:22] [ info] [storage] root path '/var/lib/google-cloud-ops-agent/fluent-bit/buffers'
[2022/02/16 22:39:22] [ info] [storage] normal synchronization mode, checksum enabled, max_chunks_up=128
[2022/02/16 22:39:22] [ info] [storage] backlog input plugin: storage_backlog.2
[2022/02/16 22:39:22] [ info] [cmetrics] version=0.2.2
[2022/02/16 22:39:22] [ info] [input:storage_backlog:storage_backlog.2] queue memory limit: 47.7M
[2022/02/16 22:39:22] [ info] [output:stackdriver:stackdriver.0] metadata_server set to http://metadata.google.internal
[2022/02/16 22:39:22] [ warn] [output:stackdriver:stackdriver.0] client_email is not defined, using a default one
[2022/02/16 22:39:22] [ warn] [output:stackdriver:stackdriver.0] private_key is not defined, fetching it from metadata server
[2022/02/16 22:39:22] [ warn] [net] getaddrinfo(host='metadata.google.internal', err=-2): Name or service not known
[2022/02/16 22:39:22] [error] [output:stackdriver:stackdriver.0] failed to create metadata connection
[2022/02/16 22:39:22] [error] [output:stackdriver:stackdriver.0] can't fetch token from the metadata server
[2022/02/16 22:39:22] [ warn] [output:stackdriver:stackdriver.0] token retrieval failed
[2022/02/16 22:39:22] [ warn] [net] getaddrinfo(host='metadata.google.internal', err=-2): Name or service not known
[2022/02/16 22:39:22] [error] [output:stackdriver:stackdriver.0] failed to create metadata connection
[2022/02/16 22:39:22] [error] [output:stackdriver:stackdriver.0] can't fetch project id from the metadata server
[2022/02/16 22:39:22] [error] [output] failed to initialize 'stackdriver' plugin
[2022/02/16 22:39:22] [ info] [input] pausing fluentbit_metrics.0
[2022/02/16 22:39:22] [ info] [input] pausing tail.1
[2022/02/16 22:39:22] [ info] [input] pausing storage_backlog.2

我注意到 private_key is not defined, fetching it from metadata server,这表明密钥文件未被提取。

文档说 Ops Agent 是从 Compute Engine 实例收集遥测数据的主要代理。参见 here

Ops Agent 能否仅在 Compute Engine 实例上 运行,或者如果配置正确,它可以在任何地方 运行 是否合理?

Ops Agent 正在寻找凭据但没有找到。

这意味着您没有使用正确的文件访问权限将服务帐户复制到正确的位置,或者您没有使用正确的文件访问权限正确设置环境变量GOOGLE_APPLICATION_CREDENTIALS。

代理然后检查不支持 Google OAuth 访问令牌的元数据服务(如果设置,Azure 会提供 MSI 凭据)

启动google-cloud-ops-agent.service时,启动google-cloud-ops-agent-fluent-bit.servicegoogle-cloud-ops-agent-opentelemetry-collector.service 然后退出。作为覆盖添加到 google-cloud-ops-agent.service 的环境变量不会保留给其他变量。

我发现我必须将 GOOGLE_APPLICATION_CREDENTIALS 添加到 google-cloud-ops-agent-opentelemetry-collector.serviceGOOGLE_SERVICE_CREDENTIALS to google-cloud-ops-agent-fluent-bit.service. You can override the systemd units non-interactively:

SYSTEMD_EDITOR=tee systemctl edit google-cloud-ops-agent-fluent-bit.service <<'EOF'
[Service]
Environment='GOOGLE_SERVICE_CREDENTIALS=/etc/google/auth/application_default_credentials.json'
EOF

SYSTEMD_EDITOR=tee systemctl edit google-cloud-ops-agent-opentelemetry-collector.service <<'EOF'
[Service]
Environment='GOOGLE_APPLICATION_CREDENTIALS=/etc/google/auth/application_default_credentials.json'
EOF