Prometheus pod 无法调用 apiserver 端点

Prometheus pod unable to call apiserver endpoints

我正在尝试通过 helm install stable/prometheus 将监控堆栈(prometheus + alertmanager + node_exporter 等)设置到我的 raspberry pi k8s 集群(1 个主节点 + 3 个工作节点)上设置。

设法获得所有必需的 pods 运行。

pi-monitoring-prometheus-alertmanager-767cd8bc65-89hxt   2/2     Running            0          131m    10.17.2.56      kube2   <none>           <none>
pi-monitoring-prometheus-node-exporter-h86gt             1/1     Running            0          131m    192.168.1.212   kube2   <none>           <none>
pi-monitoring-prometheus-node-exporter-kg957             1/1     Running            0          131m    192.168.1.211   kube1   <none>           <none>
pi-monitoring-prometheus-node-exporter-x9wgb             1/1     Running            0          131m    192.168.1.213   kube3   <none>           <none>
pi-monitoring-prometheus-pushgateway-799d4ff9d6-rdpkf    1/1     Running            0          131m    10.17.3.36      kube1   <none>           <none>
pi-monitoring-prometheus-server-5d989754b6-gp69j         2/2     Running            0          98m     10.17.1.60      kube3   <none>           <none>

然而,在端口转发 prometheus 服务器端口 9090 并导航到 Targets 页面后,我发现 none 个 node_exporter 已注册。

通过日志挖掘,我发现了这个

evel=error ts=2020-04-12T05:15:05.083Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:333: Failed to list *v1.Node: Get https://10.18.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 10.18.0.1:443: i/o timeout"
level=error ts=2020-04-12T05:15:05.084Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:299: Failed to list *v1.Service: Get https://10.18.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.18.0.1:443: i/o timeout"
level=error ts=2020-04-12T05:15:05.084Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:261: Failed to list *v1.Endpoints: Get https://10.18.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.18.0.1:443: i/o timeout"
level=error ts=2020-04-12T05:15:05.085Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:262: Failed to list *v1.Service: Get https://10.18.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.18.0.1:443: i/o timeout"

问题:为什么prometheus pod无法调用apiserver端点?不太确定哪里配置错了

跟进debug guide,发现个别节点无法解析其他节点的服务。

过去 1 天一直在阅读各种资料进行故障排除,但老实说,我什至不确定从哪里开始。

这些是 kube-system 命名空间中的 pods 运行。希望这能让我更好地了解我的系统是如何设置的。

pi@kube4:~ $ kubectl get pods -n kube-system -o wide
NAME                            READY   STATUS    RESTARTS   AGE   IP              NODE    NOMINATED NODE   READINESS GATES
coredns-66bff467f8-nzvq8        1/1     Running   0          13d   10.17.0.2       kube4   <none>           <none>
coredns-66bff467f8-z7wdb        1/1     Running   0          13d   10.17.0.3       kube4   <none>           <none>
etcd-kube4                      1/1     Running   0          13d   192.168.1.214   kube4   <none>           <none>
kube-apiserver-kube4            1/1     Running   2          13d   192.168.1.214   kube4   <none>           <none>
kube-controller-manager-kube4   1/1     Running   2          13d   192.168.1.214   kube4   <none>           <none>
kube-flannel-ds-arm-8g9fb       1/1     Running   1          13d   192.168.1.212   kube2   <none>           <none>
kube-flannel-ds-arm-c5qt9       1/1     Running   0          13d   192.168.1.214   kube4   <none>           <none>
kube-flannel-ds-arm-q5pln       1/1     Running   1          13d   192.168.1.211   kube1   <none>           <none>
kube-flannel-ds-arm-tkmn6       1/1     Running   1          13d   192.168.1.213   kube3   <none>           <none>
kube-proxy-4zjjh                1/1     Running   0          13d   192.168.1.213   kube3   <none>           <none>
kube-proxy-6mk2z                1/1     Running   0          13d   192.168.1.211   kube1   <none>           <none>
kube-proxy-bbr8v                1/1     Running   0          13d   192.168.1.212   kube2   <none>           <none>
kube-proxy-wfsbm                1/1     Running   0          13d   192.168.1.214   kube4   <none>           <none>
kube-scheduler-kube4            1/1     Running   3          13d   192.168.1.214   kube4   <none>           <none>

我怀疑存在网络问题,导致您无法访问 API 服务器。 "dial tcp 10.18.0.1:443: i/o timeout" 一般反映您无法连接或读取服务器。您可以使用以下步骤找出问题所在: 1. 使用 kubectl run busybox --image=busybox -n kube-system 部署一个 busybox pod 2. 使用 kubectl exec -n kube-system -it <podname> sh 进入 pod 3. 尝试从 telnet 10.18.0.1 443 之类的 tty 进行 telnet 以找出连接问题

让我知道输出结果。

Flannel documentation 状态:

NOTE: If kubeadm is used, then pass --pod-network-cidr=10.244.0.0/16 to kubeadm init to ensure that the podCIDR is set.

这是因为 flannel ConfigMap 默认配置为在 "Network": "10.244.0.0/16"

上工作

您已经使用 --pod-network-cidr=10.17.0.0/16 配置了您的 kubeadm,现在这需要在 flannel ConfigMap kube-flannel-cfg 中进行配置,如下所示:

kind: ConfigMap
apiVersion: v1
metadata:
  name: kube-flannel-cfg
  namespace: kube-system
  labels:
    tier: node
    app: flannel
data:
  cni-conf.json: |
    {
      "name": "cbr0",
      "cniVersion": "0.3.1",
      "plugins": [
        {
          "type": "flannel",
          "delegate": {
            "hairpinMode": true,
            "isDefaultGateway": true
          }
        },
        {
          "type": "portmap",
          "capabilities": {
            "portMappings": true
          }
        }
      ]
    }
  net-conf.json: |
    {
      "Network": "10.17.0.0/16",
      "Backend": {
        "Type": "vxlan"
      }
    }

感谢 @kitt 的调试帮助。

经过多次故障排除后,我意识到我无法从其他节点 ping 其他 pods,而只能从节点内的节点 ping 通。问题似乎与此处 https://github.com/coreos/flannel/issues/699.

所述的 iptables 配置有关

tl;dr: 运行 iptables --policy FORWARD ACCEPT 解决了我的问题。 在更新 iptables 策略之前

Chain FORWARD (policy DROP)
target     prot opt source               destination
KUBE-FORWARD  all  --  anywhere             anywhere             /* kubernetes forwarding rules */
KUBE-SERVICES  all  --  anywhere             anywhere             ctstate NEW /* kubernetes service portals */
DOCKER-USER  all  --  anywhere             anywhere
DOCKER-ISOLATION-STAGE-1  all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere

问题现已解决。感谢@kitt 之前的帮助!