prometheus 中未显示节点导出器目标 UI

node-exporter targets not showing in prometheus UI

我使用 kubeadm 设置了一个 Kubernetes 集群。 我在上面安装了 prometheus 和 node-exporter,基于:

pods 似乎 运行 正确:

 kubectl get pods --namespace=monitoring -o wide
NAME                                     READY   STATUS    RESTARTS   AGE   IP             NODE         NOMINATED NODE   READINESS GATES
node-exporter-jk2sd                      1/1     Running   0          90m   192.168.5.20   work03   <none>           <none>
node-exporter-jldrx                      1/1     Running   0          90m   192.168.5.17   work04   <none>           <none>
node-exporter-mgtld                      1/1     Running   0          90m   192.168.5.15   work01   <none>           <none>
node-exporter-tq7bx                      1/1     Running   0          90m   192.168.5.41   work02   <none>           <none>
prometheus-deployment-5d79b5f65b-tkpd2   1/1     Running   0          91m   192.168.5.40   work02   <none>           <none>

我也可以看到端点:

kubectl get endpoints -n monitoring
NAME            ENDPOINTS                                                           AGE
node-exporter   192.168.5.15:9100,192.168.5.17:9100,192.168.5.20:9100 + 1 more...   5m3s

我也这样做了:kubectl port-forward prometheus-deployment-5d79b5f65b-tkpd2 8080:9090 -n monitoring 当我访问 prometheus web UI > Status > Targets 时,我在那里找不到节点导出器。当我开始为 node-exporter 报告的指标键入查询时,它不会自动显示在查询编辑器中。

来自 prometheus pod 的日志似乎有很多错误:

kubectl logs prometheus-deployment-5d79b5f65b-tkpd2 -n monitoring
level=info ts=2021-08-11T16:24:21.743Z caller=main.go:428 msg="Starting Prometheus" version="(version=2.29.1, branch=HEAD, revision=dcb07e8eac34b5ea37cd229545000b857f1c1637)"
level=info ts=2021-08-11T16:24:21.743Z caller=main.go:433 build_context="(go=go1.16.7, user=root@364730518a4e, date=20210811-14:48:27)"
level=info ts=2021-08-11T16:24:21.743Z caller=main.go:434 host_details="(Linux 5.4.0-70-generic #78-Ubuntu SMP Fri Mar 19 13:29:52 UTC 2021 x86_64 prometheus-deployment-5d79b5f65b-tkpd2 (none))"
level=info ts=2021-08-11T16:24:21.743Z caller=main.go:435 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2021-08-11T16:24:21.743Z caller=main.go:436 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2021-08-11T16:24:21.745Z caller=web.go:541 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2021-08-11T16:24:21.745Z caller=main.go:812 msg="Starting TSDB ..."
level=info ts=2021-08-11T16:24:21.748Z caller=tls_config.go:191 component=web msg="TLS is disabled." http2=false
level=info ts=2021-08-11T16:24:21.753Z caller=head.go:815 component=tsdb msg="Replaying on-disk memory mappable chunks if any"
level=info ts=2021-08-11T16:24:21.753Z caller=head.go:829 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=4.15µs
level=info ts=2021-08-11T16:24:21.753Z caller=head.go:835 component=tsdb msg="Replaying WAL, this may take a while"
level=info ts=2021-08-11T16:24:21.754Z caller=head.go:892 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0
level=info ts=2021-08-11T16:24:21.754Z caller=head.go:898 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=75.316µs wal_replay_duration=451.769µs total_replay_duration=566.051µs
level=info ts=2021-08-11T16:24:21.756Z caller=main.go:839 fs_type=EXT4_SUPER_MAGIC
level=info ts=2021-08-11T16:24:21.756Z caller=main.go:842 msg="TSDB started"
level=info ts=2021-08-11T16:24:21.756Z caller=main.go:969 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2021-08-11T16:24:21.757Z caller=kubernetes.go:282 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-08-11T16:24:21.759Z caller=kubernetes.go:282 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-08-11T16:24:21.762Z caller=kubernetes.go:282 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-08-11T16:24:21.764Z caller=main.go:1006 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml totalDuration=7.940972ms db_storage=607ns remote_storage=1.251µs web_handler=283ns query_engine=694ns scrape=227.668µs scrape_sd=6.081132ms notify=27.11µs notify_sd=16.477µs rules=648.58µs
level=info ts=2021-08-11T16:24:21.764Z caller=main.go:784 msg="Server is ready to receive web requests."
level=error ts=2021-08-11T16:24:51.765Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:24:51.765Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get \"https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:24:51.765Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Service: failed to list *v1.Service: Get \"https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:24:51.766Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:24:51.766Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Node: failed to list *v1.Node: Get \"https://10.96.0.1:443/api/v1/nodes?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:25:22.587Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Service: failed to list *v1.Service: Get \"https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:25:22.855Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:25:23.153Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get \"https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:25:23.261Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:25:23.335Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Node: failed to list *v1.Node: Get \"https://10.96.0.1:443/api/v1/nodes?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:25:54.814Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:25:55.282Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Node: failed to list *v1.Node: Get \"https://10.96.0.1:443/api/v1/nodes?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:25:55.516Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Service: failed to list *v1.Service: Get \"https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:25:55.934Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get \"https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:25:56.442Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:26:30.058Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:26:30.204Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get \"https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:26:30.246Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Service: failed to list *v1.Service: Get \"https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:26:30.879Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:26:31.479Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Node: failed to list *v1.Node: Get \"https://10.96.0.1:443/api/v1/nodes?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:27:09.673Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:27:09.835Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Service: failed to list *v1.Service: Get \"https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:27:10.467Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:27:11.170Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get \"https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:27:12.684Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Node: failed to list *v1.Node: Get \"https://10.96.0.1:443/api/v1/nodes?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:27:55.324Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Service: failed to list *v1.Service: Get \"https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:28:01.550Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:28:01.621Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get \"https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:28:04.801Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:28:05.598Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Node: failed to list *v1.Node: Get \"https://10.96.0.1:443/api/v1/nodes?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:28:57.256Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:29:04.688Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"

有没有办法解决这个问题并让节点导出器出现在目标中?

版本详情:

kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"clean", BuildDate:"2021-03-18T01:10:43Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.9", GitCommit:"7a576bc3935a6b555e33346fd73ad77c925e9e4a", GitTreeState:"clean", BuildDate:"2021-07-15T20:56:38Z", GoVersion:"go1.15.14", Compiler:"gc", Platform:"linux/amd64"}

编辑: 集群设置如下:

sudo kubeadm reset
sudo rm $HOME/.kube/config
sudo kubeadm init --pod-network-cidr=192.168.5.0/24
mkdir -p $HOME/.kube; sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config; sudo chown $(id -u):$(id -g) $HOME/.kube/config
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

正在使用法兰绒。

法兰绒pods是运行:

kube-flannel-ds-45qwf                1/1     Running   0          31h   x.x.x.41   work01   <none>           <none>
kube-flannel-ds-4rwzj                1/1     Running   0          31h   x.x.x.40   mast01   <none>           <none>
kube-flannel-ds-8fdtt                1/1     Running   24         31h   x.x.x.43   work03   <none>           <none>
kube-flannel-ds-8hl5f                1/1     Running   23         31h   x.x.x.44   work04   <none>           <none>
kube-flannel-ds-xqtrd                1/1     Running   0          31h   x.x.x.42   work02   <none>           <none>

此问题与 SDN 无法正常工作有关。

作为一般规则,解决这个问题,我们会检查 SDN pods(印花布、编织,或者在这种情况下是法兰绒),它们是否健康,日志中的任何错误,...

检查 iptables (iptables -nL) 和 ipvs (ipvsadm -l n) 配置节点。

重新启动 SDN pods,以及 kube-proxy,如果您仍然没有找到任何东西。

现在,在这个特定的案例中,我们没有遭受中断的困扰:集群是新部署的,很可能 SDN 根本就没有工作过——尽管这可能并不明显,但对于 kubeadm 部署来说,这并没有”不附带其他 pods 默认值,其中大部分使用主机网络。

kubeadm init命令提到pod CIDR是某个192.168.5.0/24,这带来了两个备注:

  • 对于所有 SDN:pod CIDR 是一个子网,将被拆分成更小的子网(通常是 /24 或 /25)。每个范围在第一次加入集群时静态分配给节点

  • 运行 flannel SDN:kubeadm init 应该包含一个 --pod-network-cidr 参数,该参数必须与 kube-flannel-cfg ConfigMap 中配置的子网相匹配,参见 net-conf.json键。

虽然我不熟悉解决这个问题的过程,但 ServerFault 上似乎有一个答案给出了一些说明,听起来不错:https://serverfault.com/a/977401/293779