Prometheus pod 无法调用 apiserver 端点
Prometheus pod unable to call apiserver endpoints
我正在尝试通过 helm install stable/prometheus
将监控堆栈(prometheus + alertmanager + node_exporter 等)设置到我的 raspberry pi k8s 集群(1 个主节点 + 3 个工作节点)上设置。
设法获得所有必需的 pods 运行。
pi-monitoring-prometheus-alertmanager-767cd8bc65-89hxt 2/2 Running 0 131m 10.17.2.56 kube2 <none> <none>
pi-monitoring-prometheus-node-exporter-h86gt 1/1 Running 0 131m 192.168.1.212 kube2 <none> <none>
pi-monitoring-prometheus-node-exporter-kg957 1/1 Running 0 131m 192.168.1.211 kube1 <none> <none>
pi-monitoring-prometheus-node-exporter-x9wgb 1/1 Running 0 131m 192.168.1.213 kube3 <none> <none>
pi-monitoring-prometheus-pushgateway-799d4ff9d6-rdpkf 1/1 Running 0 131m 10.17.3.36 kube1 <none> <none>
pi-monitoring-prometheus-server-5d989754b6-gp69j 2/2 Running 0 98m 10.17.1.60 kube3 <none> <none>
然而,在端口转发 prometheus 服务器端口 9090 并导航到 Targets
页面后,我发现 none 个 node_exporter 已注册。
通过日志挖掘,我发现了这个
evel=error ts=2020-04-12T05:15:05.083Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:333: Failed to list *v1.Node: Get https://10.18.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 10.18.0.1:443: i/o timeout"
level=error ts=2020-04-12T05:15:05.084Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:299: Failed to list *v1.Service: Get https://10.18.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.18.0.1:443: i/o timeout"
level=error ts=2020-04-12T05:15:05.084Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:261: Failed to list *v1.Endpoints: Get https://10.18.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.18.0.1:443: i/o timeout"
level=error ts=2020-04-12T05:15:05.085Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:262: Failed to list *v1.Service: Get https://10.18.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.18.0.1:443: i/o timeout"
问题:为什么prometheus pod无法调用apiserver端点?不太确定哪里配置错了
跟进debug guide,发现个别节点无法解析其他节点的服务。
过去 1 天一直在阅读各种资料进行故障排除,但老实说,我什至不确定从哪里开始。
这些是 kube-system
命名空间中的 pods 运行。希望这能让我更好地了解我的系统是如何设置的。
pi@kube4:~ $ kubectl get pods -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
coredns-66bff467f8-nzvq8 1/1 Running 0 13d 10.17.0.2 kube4 <none> <none>
coredns-66bff467f8-z7wdb 1/1 Running 0 13d 10.17.0.3 kube4 <none> <none>
etcd-kube4 1/1 Running 0 13d 192.168.1.214 kube4 <none> <none>
kube-apiserver-kube4 1/1 Running 2 13d 192.168.1.214 kube4 <none> <none>
kube-controller-manager-kube4 1/1 Running 2 13d 192.168.1.214 kube4 <none> <none>
kube-flannel-ds-arm-8g9fb 1/1 Running 1 13d 192.168.1.212 kube2 <none> <none>
kube-flannel-ds-arm-c5qt9 1/1 Running 0 13d 192.168.1.214 kube4 <none> <none>
kube-flannel-ds-arm-q5pln 1/1 Running 1 13d 192.168.1.211 kube1 <none> <none>
kube-flannel-ds-arm-tkmn6 1/1 Running 1 13d 192.168.1.213 kube3 <none> <none>
kube-proxy-4zjjh 1/1 Running 0 13d 192.168.1.213 kube3 <none> <none>
kube-proxy-6mk2z 1/1 Running 0 13d 192.168.1.211 kube1 <none> <none>
kube-proxy-bbr8v 1/1 Running 0 13d 192.168.1.212 kube2 <none> <none>
kube-proxy-wfsbm 1/1 Running 0 13d 192.168.1.214 kube4 <none> <none>
kube-scheduler-kube4 1/1 Running 3 13d 192.168.1.214 kube4 <none> <none>
我怀疑存在网络问题,导致您无法访问 API 服务器。 "dial tcp 10.18.0.1:443: i/o timeout" 一般反映您无法连接或读取服务器。您可以使用以下步骤找出问题所在:
1. 使用 kubectl run busybox --image=busybox -n kube-system
部署一个 busybox pod
2. 使用 kubectl exec -n kube-system -it <podname> sh
进入 pod
3. 尝试从 telnet 10.18.0.1 443
之类的 tty 进行 telnet 以找出连接问题
让我知道输出结果。
NOTE: If kubeadm
is used, then pass --pod-network-cidr=10.244.0.0/16
to kubeadm init
to ensure that the podCIDR
is set.
这是因为 flannel ConfigMap 默认配置为在 "Network": "10.244.0.0/16"
上工作
您已经使用 --pod-network-cidr=10.17.0.0/16
配置了您的 kubeadm,现在这需要在 flannel ConfigMap kube-flannel-cfg
中进行配置,如下所示:
kind: ConfigMap
apiVersion: v1
metadata:
name: kube-flannel-cfg
namespace: kube-system
labels:
tier: node
app: flannel
data:
cni-conf.json: |
{
"name": "cbr0",
"cniVersion": "0.3.1",
"plugins": [
{
"type": "flannel",
"delegate": {
"hairpinMode": true,
"isDefaultGateway": true
}
},
{
"type": "portmap",
"capabilities": {
"portMappings": true
}
}
]
}
net-conf.json: |
{
"Network": "10.17.0.0/16",
"Backend": {
"Type": "vxlan"
}
}
感谢 @kitt 的调试帮助。
经过多次故障排除后,我意识到我无法从其他节点 ping 其他 pods,而只能从节点内的节点 ping 通。问题似乎与此处 https://github.com/coreos/flannel/issues/699.
所述的 iptables 配置有关
tl;dr: 运行 iptables --policy FORWARD ACCEPT
解决了我的问题。
在更新 iptables 策略之前
Chain FORWARD (policy DROP)
target prot opt source destination
KUBE-FORWARD all -- anywhere anywhere /* kubernetes forwarding rules */
KUBE-SERVICES all -- anywhere anywhere ctstate NEW /* kubernetes service portals */
DOCKER-USER all -- anywhere anywhere
DOCKER-ISOLATION-STAGE-1 all -- anywhere anywhere
ACCEPT all -- anywhere anywhere ctstate RELATED,ESTABLISHED
DOCKER all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
问题现已解决。感谢@kitt 之前的帮助!
我正在尝试通过 helm install stable/prometheus
将监控堆栈(prometheus + alertmanager + node_exporter 等)设置到我的 raspberry pi k8s 集群(1 个主节点 + 3 个工作节点)上设置。
设法获得所有必需的 pods 运行。
pi-monitoring-prometheus-alertmanager-767cd8bc65-89hxt 2/2 Running 0 131m 10.17.2.56 kube2 <none> <none>
pi-monitoring-prometheus-node-exporter-h86gt 1/1 Running 0 131m 192.168.1.212 kube2 <none> <none>
pi-monitoring-prometheus-node-exporter-kg957 1/1 Running 0 131m 192.168.1.211 kube1 <none> <none>
pi-monitoring-prometheus-node-exporter-x9wgb 1/1 Running 0 131m 192.168.1.213 kube3 <none> <none>
pi-monitoring-prometheus-pushgateway-799d4ff9d6-rdpkf 1/1 Running 0 131m 10.17.3.36 kube1 <none> <none>
pi-monitoring-prometheus-server-5d989754b6-gp69j 2/2 Running 0 98m 10.17.1.60 kube3 <none> <none>
然而,在端口转发 prometheus 服务器端口 9090 并导航到 Targets
页面后,我发现 none 个 node_exporter 已注册。
通过日志挖掘,我发现了这个
evel=error ts=2020-04-12T05:15:05.083Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:333: Failed to list *v1.Node: Get https://10.18.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 10.18.0.1:443: i/o timeout"
level=error ts=2020-04-12T05:15:05.084Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:299: Failed to list *v1.Service: Get https://10.18.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.18.0.1:443: i/o timeout"
level=error ts=2020-04-12T05:15:05.084Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:261: Failed to list *v1.Endpoints: Get https://10.18.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.18.0.1:443: i/o timeout"
level=error ts=2020-04-12T05:15:05.085Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:262: Failed to list *v1.Service: Get https://10.18.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.18.0.1:443: i/o timeout"
问题:为什么prometheus pod无法调用apiserver端点?不太确定哪里配置错了
跟进debug guide,发现个别节点无法解析其他节点的服务。
过去 1 天一直在阅读各种资料进行故障排除,但老实说,我什至不确定从哪里开始。
这些是 kube-system
命名空间中的 pods 运行。希望这能让我更好地了解我的系统是如何设置的。
pi@kube4:~ $ kubectl get pods -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
coredns-66bff467f8-nzvq8 1/1 Running 0 13d 10.17.0.2 kube4 <none> <none>
coredns-66bff467f8-z7wdb 1/1 Running 0 13d 10.17.0.3 kube4 <none> <none>
etcd-kube4 1/1 Running 0 13d 192.168.1.214 kube4 <none> <none>
kube-apiserver-kube4 1/1 Running 2 13d 192.168.1.214 kube4 <none> <none>
kube-controller-manager-kube4 1/1 Running 2 13d 192.168.1.214 kube4 <none> <none>
kube-flannel-ds-arm-8g9fb 1/1 Running 1 13d 192.168.1.212 kube2 <none> <none>
kube-flannel-ds-arm-c5qt9 1/1 Running 0 13d 192.168.1.214 kube4 <none> <none>
kube-flannel-ds-arm-q5pln 1/1 Running 1 13d 192.168.1.211 kube1 <none> <none>
kube-flannel-ds-arm-tkmn6 1/1 Running 1 13d 192.168.1.213 kube3 <none> <none>
kube-proxy-4zjjh 1/1 Running 0 13d 192.168.1.213 kube3 <none> <none>
kube-proxy-6mk2z 1/1 Running 0 13d 192.168.1.211 kube1 <none> <none>
kube-proxy-bbr8v 1/1 Running 0 13d 192.168.1.212 kube2 <none> <none>
kube-proxy-wfsbm 1/1 Running 0 13d 192.168.1.214 kube4 <none> <none>
kube-scheduler-kube4 1/1 Running 3 13d 192.168.1.214 kube4 <none> <none>
我怀疑存在网络问题,导致您无法访问 API 服务器。 "dial tcp 10.18.0.1:443: i/o timeout" 一般反映您无法连接或读取服务器。您可以使用以下步骤找出问题所在:
1. 使用 kubectl run busybox --image=busybox -n kube-system
部署一个 busybox pod
2. 使用 kubectl exec -n kube-system -it <podname> sh
进入 pod
3. 尝试从 telnet 10.18.0.1 443
之类的 tty 进行 telnet 以找出连接问题
让我知道输出结果。
NOTE: If
kubeadm
is used, then pass--pod-network-cidr=10.244.0.0/16
tokubeadm init
to ensure that thepodCIDR
is set.
这是因为 flannel ConfigMap 默认配置为在 "Network": "10.244.0.0/16"
您已经使用 --pod-network-cidr=10.17.0.0/16
配置了您的 kubeadm,现在这需要在 flannel ConfigMap kube-flannel-cfg
中进行配置,如下所示:
kind: ConfigMap
apiVersion: v1
metadata:
name: kube-flannel-cfg
namespace: kube-system
labels:
tier: node
app: flannel
data:
cni-conf.json: |
{
"name": "cbr0",
"cniVersion": "0.3.1",
"plugins": [
{
"type": "flannel",
"delegate": {
"hairpinMode": true,
"isDefaultGateway": true
}
},
{
"type": "portmap",
"capabilities": {
"portMappings": true
}
}
]
}
net-conf.json: |
{
"Network": "10.17.0.0/16",
"Backend": {
"Type": "vxlan"
}
}
感谢 @kitt 的调试帮助。
经过多次故障排除后,我意识到我无法从其他节点 ping 其他 pods,而只能从节点内的节点 ping 通。问题似乎与此处 https://github.com/coreos/flannel/issues/699.
所述的 iptables 配置有关tl;dr: 运行 iptables --policy FORWARD ACCEPT
解决了我的问题。
在更新 iptables 策略之前
Chain FORWARD (policy DROP)
target prot opt source destination
KUBE-FORWARD all -- anywhere anywhere /* kubernetes forwarding rules */
KUBE-SERVICES all -- anywhere anywhere ctstate NEW /* kubernetes service portals */
DOCKER-USER all -- anywhere anywhere
DOCKER-ISOLATION-STAGE-1 all -- anywhere anywhere
ACCEPT all -- anywhere anywhere ctstate RELATED,ESTABLISHED
DOCKER all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
问题现已解决。感谢@kitt 之前的帮助!