如何 运行 基于 Prometheus 警报的 pod

How to run pod based on Prometheus alert

我们有什么方法可以 运行 基于从 Prometheus 发出的警报来 pod 吗?我们有一个场景,我们需要根据磁盘压力阈值来执行一个 pod。 我能够创建警报,但我需要执行一个 pod。我怎样才能做到这一点?

groups:
  - name: node_memory_MemAvailable_percent
    rules:
    - alert: node_memory_MemAvailable_percent_alert
      annotations:
        description: Memory on node {{ $labels.instance }} currently at {{ $value }}% 
          is under pressure
        summary: Memory usage is under pressure, system may become unstable.
      expr: |
        100 - ((node_memory_MemAvailable_bytes{job="node-exporter"} * 100) / node_memory_MemTotal_bytes{job="node-exporter"}) > 80
      for: 2m
      labels:
        severity: warning

我认为 Alertmanager 可以帮助您,使用 webhook 接收器 (documentation)。

这样,当触发警报时,Prometheus 将其发送到 Alertmanager,然后 Alertmanager 对自定义 webhook 执行 POST。

当然,您需要实现一个服务来处理警报并运行您的操作。

一般来说,你的问题显示的是磁盘压力,在代码中我可以看到可用内存量。如果你想根据你的内存放大和缩小你的副本,你可以实现 Horizontal Pod Autoscaler:

The Horizontal Pod Autoscaler is implemented as a control loop, with a period controlled by the controller manager's --horizontal-pod-autoscaler-sync-period flag (with a default value of 15 seconds).

During each period, the controller manager queries the resource utilization against the metrics specified in each HorizontalPodAutoscaler definition. The controller manager obtains the metrics from either the resource metrics API (for per-pod resource metrics), or the custom metrics API (for all other metrics).

您可以基于memory utilization创建您自己的HPA。这是示例:

apiVersion: autoscaling/v2beta2 
kind: HorizontalPodAutoscaler
metadata:
  name: php-memory-scale 
spec:
  scaleTargetRef:
    apiVersion: apps/v1 
    kind: Deployment 
    name: php-apache 
  minReplicas: 1 
  maxReplicas: 10 
  metrics: 
  - type: Resource
    resource:
      name: memory 
      target:
        type: Utilization 
        averageValue: 10Mi 

您还可以创建自定义 Kubernetes HPA with custom metrics from Prometheus:

Autoscaling is an approach to automatically scale up or down workloads based on the resource usage. The K8s Horizontal Pod Autoscaler:

  • is implemented as a control loop that periodically queries the Resource Metrics API for core metrics, through metrics.k8s.io API, like CPU/memory and the Custom Metrics API for application-specific metrics (external.metrics.k8s.io or custom.metrics.k8s.io API. They are provided by “adapter” API servers offered by metrics solution vendors. There are some known solutions, but none of those implementations are officially part of Kubernetes)
  • automatically scales the number of pods in a deployment or replica set based on the observed metrics.

In what follows we’ll focus on the custom metrics because the Custom Metrics API made it possible for monitoring systems like Prometheus to expose application-specific metrics to the HPA controller.

另一个解决方案可能是使用 KEDA. Look at this guide。这是用于监控来自 nginx 的 500 个错误的示例 yaml:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
 name: nginx-scale
 namespace: keda-hpa
spec:
 scaleTargetRef:
   kind: Deployment
   name: nginx-server
 minReplicaCount: 1
 maxReplicaCount: 5
 cooldownPeriod: 30
 pollingInterval: 1
 triggers:
 - type: prometheus
   metadata:
     serverAddress: https://prometheus_server/prometheus
     metricName: nginx_connections_waiting_keda
     query: |
       sum(nginx_connections_waiting{job="nginx"})
     threshold: "500"

是的,我们有 webhook,但我们通过使用 am executor 作为来自 am executor 自定义脚本的自定义服务来实现服务,我们有 运行 来自 ado pipeline

的所需作业

您可以使用名为 Robusta 的开源项目来完成此操作。 (免责声明:我是维护者。)

首先,定义要触发的 Prometheus 警报:

customPlaybooks:
- triggers:
  - on_prometheus_alert:
      alert_name: DiskSpaceAlertName
  actions:
  - disk_watcher: {}

其次,我们需要编写触发时运行的实际操作。 (上面称为 disk_watcher。)如果有人已经根据您的需要编写了一个动作,您可以跳过此步骤,因为已经有 50 多个内置 actions

在这种情况下,没有内置动作,因此我们需要在 Python 中编写一个。 (不过我很乐意添加一个内置的:)

@action
def disk_watcher(event: DeploymentEvent):
    deployment = event.get_deployment()

    # read / modify the resources here
    print(deployment.spec.template.spec.containers[0].resources)
    # here you would do the actual update to the resources you like
    ...
    # afterwards, save the change
    deployment.update()

    # fetch the relevant pod
    pod = RobustaPod.find_pod(deployment.metadata.name, deployment.metadata.namespace)

    # see what is using up disk space
    output = pod.exec("df -h")

    # create another pod
    other_output = RobustaPod.exec_in_debugger_pod("my-new-pod", pod.spec.nodeName, "cmd to run", "my-image")

    # send details to slack or any other destination
    event.add_enrichment([
        MarkdownBlock(f"the output from df is attached"),
        FileBlock("df.txt", output.encode()),
        FileBlock("other.txt", other_output.encode())
    ])