在 cAdvisor 的刮板中更改 Prometheus 作业标签会破坏 Grafana 仪表板
Changing Prometheus job label in scraper for cAdvisor breaks Grafana dashboards
我使用 Helm 在我的 Kubernetes 集群上安装了 Prometheus,使用社区图表 kube-prometheus-stack - and I get some beautiful dashboards in the bundled Grafana instance. I now wanted the recommender from the Vertical Pod Autoscaler to use Prometheus as a data source for historic metrics, as described here. Meaning, I had to make a change to the Prometheus scraper settings for cAdvisor, and 为我指明了正确的方向,因为在进行更改后我现在可以在指标上看到正确的 job
标签来自 cAdvisor。
不幸的是,现在 Grafana 仪表板中的一些图表已损坏。看起来它不再选择 CPU 指标 - 而只是显示 CPU 相关图表的“无数据”。
所以,我想我必须调整图表才能再次正确获取指标,但我在 Grafana 中没有看到任何明显的地方可以这样做?
不确定它是否与问题相关,但我是 运行 我在 Azure Kubernetes 服务 (AKS) 上的 Kubernetes 集群。
这是我在安装 Prometheus 时提供给 Helm 图表的完整 values.yaml
:
kubeControllerManager:
enabled: false
kubeScheduler:
enabled: false
kubeEtcd:
enabled: false
kubeProxy:
enabled: false
kubelet:
serviceMonitor:
# Diables the normal cAdvisor scraping, as we add it with the job name "kubernetes-cadvisor" under additionalScrapeConfigs
# The reason for doing this is to enable the VPA to use the metrics for the recommender
# https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/FAQ.md#how-can-i-use-prometheus-as-a-history-provider-for-the-vpa-recommender
cAdvisor: false
prometheus:
prometheusSpec:
retention: 15d
storageSpec:
volumeClaimTemplate:
spec:
# the azurefile storage class is created automatically on AKS
storageClassName: azurefile
accessModes: ["ReadWriteMany"]
resources:
requests:
storage: 50Gi
additionalScrapeConfigs:
- job_name: 'kubernetes-cadvisor'
scheme: https
metrics_path: /metrics/cadvisor
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
Kubernetes 版本:1.21.2
kube-prometheus-stack 版本:18.1.1
helm 版本:version.BuildInfo{版本:“v3.6.3”,GitCommit:“d506314abfb5d21419df8c7e7e68012379db2354”,GitTreeState:“脏”,GoVersion:“go1.16.5”}
不幸的是,我无权访问 Azure AKS,所以我在我的 GKE 集群上重现了这个问题。下面我将提供一些可能有助于解决您的问题的解释。
首先您可以尝试执行此 node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
规则以查看它是否 return 任何结果:
如果没有return任何记录,请阅读以下段落。
正在为 cAdvisor 创建抓取配置
与其为 cadvisor 创建一个全新的抓取配置,我建议使用 kubelet.serviceMonitor.cAdvisor: true
时默认生成的配置,但进行一些修改,例如将标签更改为 job=kubernetes-cadvisor
。
在我的示例中,'kubernetes-cadvisor' 抓取配置如下所示:
注意: 我在 values.yaml
文件的 additionalScrapeConfigs
下添加了此配置(values.yaml
文件的其余部分可能类似于你的)。
- job_name: 'kubernetes-cadvisor'
honor_labels: true
honor_timestamps: true
scrape_interval: 30s
scrape_timeout: 10s
metrics_path: /metrics/cadvisor
scheme: https
authorization:
type: Bearer
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
follow_redirects: true
relabel_configs:
- source_labels: [job]
separator: ;
regex: (.*)
target_label: __tmp_prometheus_job_name
replacement:
action: replace
- source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
separator: ;
regex: kubelet
replacement:
action: keep
- source_labels: [__meta_kubernetes_service_label_k8s_app]
separator: ;
regex: kubelet
replacement:
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https-metrics
replacement:
action: keep
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement:
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement:
action: replace
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement:
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: service
replacement:
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
regex: (.*)
target_label: pod
replacement:
action: replace
- source_labels: [__meta_kubernetes_pod_container_name]
separator: ;
regex: (.*)
target_label: container
replacement:
action: replace
- separator: ;
regex: (.*)
target_label: endpoint
replacement: https-metrics
action: replace
- source_labels: [__metrics_path__]
separator: ;
regex: (.*)
target_label: metrics_path
replacement:
action: replace
- source_labels: [__address__]
separator: ;
regex: (.*)
modulus: 1
target_label: __tmp_hash
replacement:
action: hashmod
- source_labels: [__tmp_hash]
separator: ;
regex: "0"
replacement:
action: keep
kubernetes_sd_configs:
- role: endpoints
kubeconfig_file: ""
follow_redirects: true
namespaces:
names:
- kube-system
修改 Prometheus 规则
默认情况下,Prometheus 规则在其 PromQL 表达式中使用 job="kubelet"
从 cAdvisor 获取数据:
将job=kubelet
改为job=kubernetes-cadvisor
后,我们还需要修改Prometheus规则中的这个标签:
注意: 我们只需要修改具有 metrics_path="/metrics/cadvisor
的规则(这些是从 cAdvisor 检索数据的规则)。
$ kubectl get prometheusrules prom-1-kube-prometheus-sta-k8s.rules -o yaml
...
- name: k8s.rules
rules:
- expr: |-
sum by (cluster, namespace, pod, container) (
irate(container_cpu_usage_seconds_total{job="kubernetes-cadvisor", metrics_path="/metrics/cadvisor", image!=""}[5m])
) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) (
1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=""})
)
record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
...
here we have a few more rules to modify...
修改Prometheus规则后等待一段时间,看看是否达到预期效果。我们可以尝试像一开始那样执行node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
。
此外,让我们检查一下我们的 Grafana 以确保它已开始正确显示我们的仪表板:
我使用 Helm 在我的 Kubernetes 集群上安装了 Prometheus,使用社区图表 kube-prometheus-stack - and I get some beautiful dashboards in the bundled Grafana instance. I now wanted the recommender from the Vertical Pod Autoscaler to use Prometheus as a data source for historic metrics, as described here. Meaning, I had to make a change to the Prometheus scraper settings for cAdvisor, and job
标签来自 cAdvisor。
不幸的是,现在 Grafana 仪表板中的一些图表已损坏。看起来它不再选择 CPU 指标 - 而只是显示 CPU 相关图表的“无数据”。
所以,我想我必须调整图表才能再次正确获取指标,但我在 Grafana 中没有看到任何明显的地方可以这样做?
不确定它是否与问题相关,但我是 运行 我在 Azure Kubernetes 服务 (AKS) 上的 Kubernetes 集群。
这是我在安装 Prometheus 时提供给 Helm 图表的完整 values.yaml
:
kubeControllerManager:
enabled: false
kubeScheduler:
enabled: false
kubeEtcd:
enabled: false
kubeProxy:
enabled: false
kubelet:
serviceMonitor:
# Diables the normal cAdvisor scraping, as we add it with the job name "kubernetes-cadvisor" under additionalScrapeConfigs
# The reason for doing this is to enable the VPA to use the metrics for the recommender
# https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/FAQ.md#how-can-i-use-prometheus-as-a-history-provider-for-the-vpa-recommender
cAdvisor: false
prometheus:
prometheusSpec:
retention: 15d
storageSpec:
volumeClaimTemplate:
spec:
# the azurefile storage class is created automatically on AKS
storageClassName: azurefile
accessModes: ["ReadWriteMany"]
resources:
requests:
storage: 50Gi
additionalScrapeConfigs:
- job_name: 'kubernetes-cadvisor'
scheme: https
metrics_path: /metrics/cadvisor
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
Kubernetes 版本:1.21.2
kube-prometheus-stack 版本:18.1.1
helm 版本:version.BuildInfo{版本:“v3.6.3”,GitCommit:“d506314abfb5d21419df8c7e7e68012379db2354”,GitTreeState:“脏”,GoVersion:“go1.16.5”}
不幸的是,我无权访问 Azure AKS,所以我在我的 GKE 集群上重现了这个问题。下面我将提供一些可能有助于解决您的问题的解释。
首先您可以尝试执行此 node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
规则以查看它是否 return 任何结果:
如果没有return任何记录,请阅读以下段落。
正在为 cAdvisor 创建抓取配置
与其为 cadvisor 创建一个全新的抓取配置,我建议使用 kubelet.serviceMonitor.cAdvisor: true
时默认生成的配置,但进行一些修改,例如将标签更改为 job=kubernetes-cadvisor
。
在我的示例中,'kubernetes-cadvisor' 抓取配置如下所示:
注意: 我在 values.yaml
文件的 additionalScrapeConfigs
下添加了此配置(values.yaml
文件的其余部分可能类似于你的)。
- job_name: 'kubernetes-cadvisor'
honor_labels: true
honor_timestamps: true
scrape_interval: 30s
scrape_timeout: 10s
metrics_path: /metrics/cadvisor
scheme: https
authorization:
type: Bearer
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
follow_redirects: true
relabel_configs:
- source_labels: [job]
separator: ;
regex: (.*)
target_label: __tmp_prometheus_job_name
replacement:
action: replace
- source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
separator: ;
regex: kubelet
replacement:
action: keep
- source_labels: [__meta_kubernetes_service_label_k8s_app]
separator: ;
regex: kubelet
replacement:
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https-metrics
replacement:
action: keep
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement:
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement:
action: replace
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement:
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: service
replacement:
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
regex: (.*)
target_label: pod
replacement:
action: replace
- source_labels: [__meta_kubernetes_pod_container_name]
separator: ;
regex: (.*)
target_label: container
replacement:
action: replace
- separator: ;
regex: (.*)
target_label: endpoint
replacement: https-metrics
action: replace
- source_labels: [__metrics_path__]
separator: ;
regex: (.*)
target_label: metrics_path
replacement:
action: replace
- source_labels: [__address__]
separator: ;
regex: (.*)
modulus: 1
target_label: __tmp_hash
replacement:
action: hashmod
- source_labels: [__tmp_hash]
separator: ;
regex: "0"
replacement:
action: keep
kubernetes_sd_configs:
- role: endpoints
kubeconfig_file: ""
follow_redirects: true
namespaces:
names:
- kube-system
修改 Prometheus 规则
默认情况下,Prometheus 规则在其 PromQL 表达式中使用 job="kubelet"
从 cAdvisor 获取数据:
将job=kubelet
改为job=kubernetes-cadvisor
后,我们还需要修改Prometheus规则中的这个标签:
注意: 我们只需要修改具有 metrics_path="/metrics/cadvisor
的规则(这些是从 cAdvisor 检索数据的规则)。
$ kubectl get prometheusrules prom-1-kube-prometheus-sta-k8s.rules -o yaml
...
- name: k8s.rules
rules:
- expr: |-
sum by (cluster, namespace, pod, container) (
irate(container_cpu_usage_seconds_total{job="kubernetes-cadvisor", metrics_path="/metrics/cadvisor", image!=""}[5m])
) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) (
1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=""})
)
record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
...
here we have a few more rules to modify...
修改Prometheus规则后等待一段时间,看看是否达到预期效果。我们可以尝试像一开始那样执行node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
。
此外,让我们检查一下我们的 Grafana 以确保它已开始正确显示我们的仪表板: