k3s - 指标服务器不适用于工作节点
k3s - Metrics server doesn't work for worker nodes
我将一个 k3s 集群部署到 2 raspberry pi 4. 一个作为 master,第二个作为 worker 使用脚本 k3s 提供以下选项:
对于主节点:
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC='server --bind-address 192.168.1.113 (which is the master node ip)' sh -
至代理节点:
curl -sfL https://get.k3s.io | \
K3S_URL=https://192.168.1.113:6443 \
K3S_TOKEN=<master-token> \
INSTALL_K3S_EXEC='agent' sh-
似乎一切正常,但 kubectl top nodes
returns 以下内容:
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
k3s-master 137m 3% 1285Mi 33%
k3s-node-01 <unknown> <unknown> <unknown> <unknown>
我也尝试部署 k8s 仪表板,根据 the docs 中的内容,但它无法工作,因为它无法到达指标服务器并出现超时错误:
"error trying to reach service: dial tcp 10.42.1.11:8443: i/o timeout"
我在 pod 日志中看到很多错误:
2021/09/17 09:24:06 Metric client health check failed: the server is currently unable to handle the request (get services dashboard-metrics-scraper). Retrying in 30 seconds.
2021/09/17 09:25:06 Metric client health check failed: the server is currently unable to handle the request (get services dashboard-metrics-scraper). Retrying in 30 seconds.
2021/09/17 09:26:06 Metric client health check failed: the server is currently unable to handle the request (get services dashboard-metrics-scraper). Retrying in 30 seconds.
2021/09/17 09:27:06 Metric client health check failed: the server is currently unable to handle the request (get services dashboard-metrics-scraper). Retrying in 30 seconds.
来自 metrics-server
窗格的日志:
elet_summary:k3s-node-01: unable to fetch metrics from Kubelet k3s-node-01 (k3s-node-01): Get https://k3s-node-01:10250/stats/summary?only_cpu_and_memory=true: dial tcp 192.168.1.106:10250: connect: no route to host
E0917 14:03:24.767949 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:k3s-node-01: unable to fetch metrics from Kubelet k3s-node-01 (k3s-node-01): Get https://k3s-node-01:10250/stats/summary?only_cpu_and_memory=true: dial tcp 192.168.1.106:10250: connect: no route to host
E0917 14:04:24.767960 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:k3s-node-01: unable to fetch metrics from Kubelet k3s-node-01 (k3s-node-01): Get https://k3s-node-01:10250/stats/summary?only_cpu_and_memory=true: dial tcp 192.168.1.106:10250: connect: no route to host
为了更好的可见性,将其从评论中移出。
创建小型集群后,我无法重现此行为,metrics-server
两个节点都运行良好,kubectl top nodes
显示了两个可用节点的信息和指标(我认为这需要一些时间是时候开始收集指标了)。
这导致故障排除步骤为什么它不起作用。检查 metrics-server
日志是解决这个问题的最有效方法:
$ kubectl logs metrics-server-58b44df574-2n9dn -n kube-system
根据日志,将有不同的步骤继续,例如在上面的评论中:
- 首先是
no route to host
,这与网络有关并且无法解析主机名
- then
i/o timeout
这意味着路由存在,但服务没有响应。这可能是由于防火墙阻止了某些 ports/sources,kubelet
不是 运行(侦听端口 10250
),或者因为它出现在 OP 上,所以 kubelet
有问题 ntp
这影响了证书和连接。
- 错误在其他情况下可能会有所不同,找到错误并根据错误进一步排除故障很重要。
我将一个 k3s 集群部署到 2 raspberry pi 4. 一个作为 master,第二个作为 worker 使用脚本 k3s 提供以下选项:
对于主节点:
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC='server --bind-address 192.168.1.113 (which is the master node ip)' sh -
至代理节点:
curl -sfL https://get.k3s.io | \
K3S_URL=https://192.168.1.113:6443 \
K3S_TOKEN=<master-token> \
INSTALL_K3S_EXEC='agent' sh-
似乎一切正常,但 kubectl top nodes
returns 以下内容:
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
k3s-master 137m 3% 1285Mi 33%
k3s-node-01 <unknown> <unknown> <unknown> <unknown>
我也尝试部署 k8s 仪表板,根据 the docs 中的内容,但它无法工作,因为它无法到达指标服务器并出现超时错误:
"error trying to reach service: dial tcp 10.42.1.11:8443: i/o timeout"
我在 pod 日志中看到很多错误:
2021/09/17 09:24:06 Metric client health check failed: the server is currently unable to handle the request (get services dashboard-metrics-scraper). Retrying in 30 seconds.
2021/09/17 09:25:06 Metric client health check failed: the server is currently unable to handle the request (get services dashboard-metrics-scraper). Retrying in 30 seconds.
2021/09/17 09:26:06 Metric client health check failed: the server is currently unable to handle the request (get services dashboard-metrics-scraper). Retrying in 30 seconds.
2021/09/17 09:27:06 Metric client health check failed: the server is currently unable to handle the request (get services dashboard-metrics-scraper). Retrying in 30 seconds.
来自 metrics-server
窗格的日志:
elet_summary:k3s-node-01: unable to fetch metrics from Kubelet k3s-node-01 (k3s-node-01): Get https://k3s-node-01:10250/stats/summary?only_cpu_and_memory=true: dial tcp 192.168.1.106:10250: connect: no route to host
E0917 14:03:24.767949 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:k3s-node-01: unable to fetch metrics from Kubelet k3s-node-01 (k3s-node-01): Get https://k3s-node-01:10250/stats/summary?only_cpu_and_memory=true: dial tcp 192.168.1.106:10250: connect: no route to host
E0917 14:04:24.767960 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:k3s-node-01: unable to fetch metrics from Kubelet k3s-node-01 (k3s-node-01): Get https://k3s-node-01:10250/stats/summary?only_cpu_and_memory=true: dial tcp 192.168.1.106:10250: connect: no route to host
为了更好的可见性,将其从评论中移出。
创建小型集群后,我无法重现此行为,metrics-server
两个节点都运行良好,kubectl top nodes
显示了两个可用节点的信息和指标(我认为这需要一些时间是时候开始收集指标了)。
这导致故障排除步骤为什么它不起作用。检查 metrics-server
日志是解决这个问题的最有效方法:
$ kubectl logs metrics-server-58b44df574-2n9dn -n kube-system
根据日志,将有不同的步骤继续,例如在上面的评论中:
- 首先是
no route to host
,这与网络有关并且无法解析主机名 - then
i/o timeout
这意味着路由存在,但服务没有响应。这可能是由于防火墙阻止了某些 ports/sources,kubelet
不是 运行(侦听端口10250
),或者因为它出现在 OP 上,所以kubelet
有问题ntp
这影响了证书和连接。 - 错误在其他情况下可能会有所不同,找到错误并根据错误进一步排除故障很重要。