如果 docker 容器停止则发出警报

Question

我正在使用 Prometheus、cAdvisor 和 Prometheus Alertmanager 监控多个容器。我想要的是在容器由于某种原因出现故障时收到警报。问题是如果容器死了，cAdvisor 就不会收集任何指标。任何查询 returns 'no data' 因为没有匹配的查询。

Answer 1

看看 Prometheus 函数 absent()

absent(v instant-vector) returns an empty vector if the vector passed to it has any elements and a 1-element vector with the value 1 if the vector passed to it has no elements.

This is useful for alerting on when no time series exist for a given metric name and label combination.

示例：

absent(nonexistent{job="myjob"}) => {job="myjob"} absent(nonexistent{job="myjob",instance=~".*"}) => {job="myjob"} absent(sum(nonexistent{job="myjob"})) => {}

这是一个警报示例：

ALERT kibana_absent
  IF absent(container_cpu_usage_seconds_total{com_docker_compose_service="kibana"})
  FOR 5s
  LABELS {
    severity="page"
  }
  ANNOTATIONS {
  SUMMARY= "Instance {{$labels.instance}} down",
  DESCRIPTION= "Instance= {{$labels.instance}}, Service/Job ={{$labels.job}} is down for more than 5 sec."
  }

Answer 2

我使用一个名为 Docker Event Monitor 的小工具，它在 Docker 主机上作为容器运行，并在触发某些事件时向 Slack、Discord 或 SparkPost 发送警报。您可以配置哪些事件触发警报。

Answer 3

试试这个：

 time() - container_last_seen{label="whatever-label-you-have", job="myjob"} > 60

如果在 60 秒内看不到容器，它会发出警报。或者

absent(container_memory_usage_bytes{label="whatever-label-you-have", job="myjob"})

请注意，在第二种方法中，容器的内存使用量可能需要一些时间才能达到 0。

Answer 4

我们可以使用这两个：

absent(container_start_time_seconds{name="my-container"})

这个包含时间戳的特定指标在 5 分钟内似乎不会过时，因此一旦它从最后一次擦除中消失（参见：https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness），它就会从普罗米修斯结果中消失，而不是像 [=25] 那样在 5 分钟后消失=] 例如。测试OK，但我不确定我是否理解了slateness...

否则你可以使用这个：

time() - timestamp(container_cpu_usage_seconds_total{name="mycontainer"}) > 60 OR absent(container_cpu_usage_seconds_total{name="mycontainer"})

第一部分给出了指标被抓取后的时间。因此，如果它从导出器输出中消失但仍由 promql 返回（默认情况下为 5 分钟），则此方法有效。您必须根据刮擦间隔调整 >60。

Answer 5

cadvisor exports container_last_seen metric, which shows the timestamp when the container was seen last time. See these docs. But cadvisor stops exporting container_last_seen metric in a few minutes after the container stops - see this issue for details. So time() - container_last_seen > 60 may miss stopped containers. This can be fixed by wrapping container_last_seen into last_over_time() 函数。例如，以下查询始终 returns 个容器，这些容器已在 60 多秒前但不到 1 小时前停止（请参阅方括号中的 1h 回顾 window）：

time() - last_over_time(container_last_seen{container!=""}[1h]) > 60

使用 lag function from MetricsQL:

时可以进一步简化此查询

lag(container_last_seen{container!=""}[1h]) > 1m

需要 container!="" 过滤器来过滤掉 cgroups 层次结构的人为指标 - 有关详细信息，请参阅 this answer。

如果 docker 容器停止则发出警报

Alert if a docker container stops

prometheus

cadvisor