kube-proxy 和 nginx 后端之间的连接被拒绝

Connection refused between kube-proxy and nginx backend

我们经常在 AWS EKS 中安装的定制 NGINX 反向代理上看到连接被拒绝错误。 (请参阅下面的 kubernetes 模板)

最初,我们认为这是负载平衡器的问题。但是,经过进一步调查,kube-proxy 和 nginx Pod 之间似乎存在问题。

当我 运行 仅针对节点的内部 IP 和所需的服务节点端口重复 wget IP:PORT 时,我们会多次看到错误的请求,最终 failed: Connection refused

而当我 运行 仅针对 Pod IP 和端口的请求时,我无法获得此连接被拒绝。

示例 wget 输出

失败:

wget ip.ap-southeast-2.compute.internal:30102
--2020-06-26 01:15:31--  http://ip.ap-southeast-2.compute.internal:30102/
Resolving ip.ap-southeast-2.compute.internal (ip.ap-southeast-2.compute.internal)... 10.1.95.3
Connecting to ip.ap-southeast-2.compute.internal (ip.ap-southeast-2.compute.internal)|10.1.95.3|:30102... failed: Connection refused.

成功:

wget ip.ap-southeast-2.compute.internal:30102
--2020-06-26 01:15:31--  http://ip.ap-southeast-2.compute.internal:30102/
Resolving ip.ap-southeast-2.compute.internal (ip.ap-southeast-2.compute.internal)... 10.1.95.3
Connecting to ip.ap-southeast-2.compute.internal (ip.ap-southeast-2.compute.internal)|10.1.95.3|:30102... connected.
HTTP request sent, awaiting response... 400 Bad Request
2020-06-26 01:15:31 ERROR 400: Bad Request.

在 NGINX 服务的日志中,我们没有看到连接拒绝了请求,而我们确实看到了其他 BAD REQUEST。

我已经阅读了关于 kube-proxy 的几个问题,我对改善这种情况的其他见解很感兴趣。

例如:https://github.com/kubernetes/kubernetes/issues/38456

非常感谢任何帮助。

Kubernetes 模板

##
# Main nginx deployment. Requires updated tag potentially for
# docker image
##
---
apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
  name: nginx-lua-ssl-deployment
  labels:
    service: https-custom-domains
spec:
  selector:
    matchLabels:
      app: nginx-lua-ssl
  replicas: 5
  template:
    metadata:
      labels:
        app: nginx-lua-ssl
        service: https-custom-domains
    spec:
      containers:
      - name: nginx-lua-ssl
        image: "0000000000.dkr.ecr.ap-southeast-2.amazonaws.com/lua-resty-auto-ssl:v0.NN"
        imagePullPolicy: Always
        ports:
        - containerPort: 8080
        - containerPort: 8443
        - containerPort: 8999
        envFrom:
         - configMapRef:
            name: https-custom-domain-conf

##
# Load balancer which manages traffic into the nginx instance
# In aws, this uses an ELB (elastic load balancer) construct
##
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
  name: nginx-lua-load-balancer
  labels:
    service: https-custom-domains
spec:
  ports:
  - name: http
    port: 80
    targetPort: 8080
  - name: https
    port: 443
    targetPort: 8443
  externalTrafficPolicy: Local
  selector:
    app: nginx-lua-ssl
  type: LoadBalancer

这是一个棘手的问题,因为它可能位于堆栈的任何层。

几点建议:

  • 检查相关节点上的 kube-proxy 运行 日志。

    $ kubectl logs <kube-proxy-pod>
    

    或通过 ssh 连接到盒子并且

    $ docker log <kube-proxy-container>
    

    您还可以尝试更改 kube-proxy DaemonSet 中 kube-proxy 日志的详细程度:

      containers:       here
      - command:         |
        - /bin/sh        |
        - -c            \|/
        - kube-proxy --v=9 --config=/var/lib/kube-proxy-config/config --hostname-override=${NODE_NAME}
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy:v1.15.10
        imagePullPolicy: IfNotPresent
        name: kube-proxy
    
  • 您的 kube-proxy 在 运行 节点中是否有足够的资源?您还可以尝试更改 kube-proxy DaemonSet 以为其提供更多资源(CPU、内存)

      containers:
      - command:
        - /bin/sh
        - -c
        - kube-proxy --v=2 --config=/var/lib/kube-proxy-config/config --hostname-override=${NODE_NAME}
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy:v1.15.10
        imagePullPolicy: IfNotPresent
        name: kube-proxy
        resources:
          requests:
            cpu: 300m <== this instead of 100m
    
  • 您可以尝试在节点上启用iptables logging。检查数据包是否由于某种原因被丢弃。

最后这个问题是由于 Pod 配置不正确导致负载均衡器将流量路由到它:

selector:
  matchLabels:
    app: redis-cli

有 5 个 nginx pods 正确接收流量,一个 utility Pod 错误接收流量并像您预期的那样通过拒绝连接进行响应。

感谢您的回复。