kube-proxy 和 nginx 后端之间的连接被拒绝
Connection refused between kube-proxy and nginx backend
我们经常在 AWS EKS 中安装的定制 NGINX 反向代理上看到连接被拒绝错误。 (请参阅下面的 kubernetes 模板)
最初,我们认为这是负载平衡器的问题。但是,经过进一步调查,kube-proxy 和 nginx Pod 之间似乎存在问题。
当我 运行 仅针对节点的内部 IP 和所需的服务节点端口重复 wget IP:PORT
时,我们会多次看到错误的请求,最终 failed: Connection refused
而当我 运行 仅针对 Pod IP 和端口的请求时,我无法获得此连接被拒绝。
示例 wget 输出
失败:
wget ip.ap-southeast-2.compute.internal:30102
--2020-06-26 01:15:31-- http://ip.ap-southeast-2.compute.internal:30102/
Resolving ip.ap-southeast-2.compute.internal (ip.ap-southeast-2.compute.internal)... 10.1.95.3
Connecting to ip.ap-southeast-2.compute.internal (ip.ap-southeast-2.compute.internal)|10.1.95.3|:30102... failed: Connection refused.
成功:
wget ip.ap-southeast-2.compute.internal:30102
--2020-06-26 01:15:31-- http://ip.ap-southeast-2.compute.internal:30102/
Resolving ip.ap-southeast-2.compute.internal (ip.ap-southeast-2.compute.internal)... 10.1.95.3
Connecting to ip.ap-southeast-2.compute.internal (ip.ap-southeast-2.compute.internal)|10.1.95.3|:30102... connected.
HTTP request sent, awaiting response... 400 Bad Request
2020-06-26 01:15:31 ERROR 400: Bad Request.
在 NGINX 服务的日志中,我们没有看到连接拒绝了请求,而我们确实看到了其他 BAD REQUEST。
我已经阅读了关于 kube-proxy
的几个问题,我对改善这种情况的其他见解很感兴趣。
例如:https://github.com/kubernetes/kubernetes/issues/38456
非常感谢任何帮助。
Kubernetes 模板
##
# Main nginx deployment. Requires updated tag potentially for
# docker image
##
---
apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
name: nginx-lua-ssl-deployment
labels:
service: https-custom-domains
spec:
selector:
matchLabels:
app: nginx-lua-ssl
replicas: 5
template:
metadata:
labels:
app: nginx-lua-ssl
service: https-custom-domains
spec:
containers:
- name: nginx-lua-ssl
image: "0000000000.dkr.ecr.ap-southeast-2.amazonaws.com/lua-resty-auto-ssl:v0.NN"
imagePullPolicy: Always
ports:
- containerPort: 8080
- containerPort: 8443
- containerPort: 8999
envFrom:
- configMapRef:
name: https-custom-domain-conf
##
# Load balancer which manages traffic into the nginx instance
# In aws, this uses an ELB (elastic load balancer) construct
##
---
apiVersion: v1
kind: Service
metadata:
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: nlb
name: nginx-lua-load-balancer
labels:
service: https-custom-domains
spec:
ports:
- name: http
port: 80
targetPort: 8080
- name: https
port: 443
targetPort: 8443
externalTrafficPolicy: Local
selector:
app: nginx-lua-ssl
type: LoadBalancer
这是一个棘手的问题,因为它可能位于堆栈的任何层。
几点建议:
检查相关节点上的 kube-proxy 运行 日志。
$ kubectl logs <kube-proxy-pod>
或通过 ssh 连接到盒子并且
$ docker log <kube-proxy-container>
您还可以尝试更改 kube-proxy DaemonSet 中 kube-proxy 日志的详细程度:
containers: here
- command: |
- /bin/sh |
- -c \|/
- kube-proxy --v=9 --config=/var/lib/kube-proxy-config/config --hostname-override=${NODE_NAME}
env:
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy:v1.15.10
imagePullPolicy: IfNotPresent
name: kube-proxy
您的 kube-proxy 在 运行 节点中是否有足够的资源?您还可以尝试更改 kube-proxy DaemonSet 以为其提供更多资源(CPU、内存)
containers:
- command:
- /bin/sh
- -c
- kube-proxy --v=2 --config=/var/lib/kube-proxy-config/config --hostname-override=${NODE_NAME}
env:
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy:v1.15.10
imagePullPolicy: IfNotPresent
name: kube-proxy
resources:
requests:
cpu: 300m <== this instead of 100m
您可以尝试在节点上启用iptables logging。检查数据包是否由于某种原因被丢弃。
最后这个问题是由于 Pod 配置不正确导致负载均衡器将流量路由到它:
selector:
matchLabels:
app: redis-cli
有 5 个 nginx pods 正确接收流量,一个 utility Pod 错误接收流量并像您预期的那样通过拒绝连接进行响应。
感谢您的回复。
我们经常在 AWS EKS 中安装的定制 NGINX 反向代理上看到连接被拒绝错误。 (请参阅下面的 kubernetes 模板)
最初,我们认为这是负载平衡器的问题。但是,经过进一步调查,kube-proxy 和 nginx Pod 之间似乎存在问题。
当我 运行 仅针对节点的内部 IP 和所需的服务节点端口重复 wget IP:PORT
时,我们会多次看到错误的请求,最终 failed: Connection refused
而当我 运行 仅针对 Pod IP 和端口的请求时,我无法获得此连接被拒绝。
示例 wget 输出
失败:
wget ip.ap-southeast-2.compute.internal:30102
--2020-06-26 01:15:31-- http://ip.ap-southeast-2.compute.internal:30102/
Resolving ip.ap-southeast-2.compute.internal (ip.ap-southeast-2.compute.internal)... 10.1.95.3
Connecting to ip.ap-southeast-2.compute.internal (ip.ap-southeast-2.compute.internal)|10.1.95.3|:30102... failed: Connection refused.
成功:
wget ip.ap-southeast-2.compute.internal:30102
--2020-06-26 01:15:31-- http://ip.ap-southeast-2.compute.internal:30102/
Resolving ip.ap-southeast-2.compute.internal (ip.ap-southeast-2.compute.internal)... 10.1.95.3
Connecting to ip.ap-southeast-2.compute.internal (ip.ap-southeast-2.compute.internal)|10.1.95.3|:30102... connected.
HTTP request sent, awaiting response... 400 Bad Request
2020-06-26 01:15:31 ERROR 400: Bad Request.
在 NGINX 服务的日志中,我们没有看到连接拒绝了请求,而我们确实看到了其他 BAD REQUEST。
我已经阅读了关于 kube-proxy
的几个问题,我对改善这种情况的其他见解很感兴趣。
例如:https://github.com/kubernetes/kubernetes/issues/38456
非常感谢任何帮助。
Kubernetes 模板
##
# Main nginx deployment. Requires updated tag potentially for
# docker image
##
---
apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
name: nginx-lua-ssl-deployment
labels:
service: https-custom-domains
spec:
selector:
matchLabels:
app: nginx-lua-ssl
replicas: 5
template:
metadata:
labels:
app: nginx-lua-ssl
service: https-custom-domains
spec:
containers:
- name: nginx-lua-ssl
image: "0000000000.dkr.ecr.ap-southeast-2.amazonaws.com/lua-resty-auto-ssl:v0.NN"
imagePullPolicy: Always
ports:
- containerPort: 8080
- containerPort: 8443
- containerPort: 8999
envFrom:
- configMapRef:
name: https-custom-domain-conf
##
# Load balancer which manages traffic into the nginx instance
# In aws, this uses an ELB (elastic load balancer) construct
##
---
apiVersion: v1
kind: Service
metadata:
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: nlb
name: nginx-lua-load-balancer
labels:
service: https-custom-domains
spec:
ports:
- name: http
port: 80
targetPort: 8080
- name: https
port: 443
targetPort: 8443
externalTrafficPolicy: Local
selector:
app: nginx-lua-ssl
type: LoadBalancer
这是一个棘手的问题,因为它可能位于堆栈的任何层。
几点建议:
检查相关节点上的 kube-proxy 运行 日志。
$ kubectl logs <kube-proxy-pod>
或通过 ssh 连接到盒子并且
$ docker log <kube-proxy-container>
您还可以尝试更改 kube-proxy DaemonSet 中 kube-proxy 日志的详细程度:
containers: here - command: | - /bin/sh | - -c \|/ - kube-proxy --v=9 --config=/var/lib/kube-proxy-config/config --hostname-override=${NODE_NAME} env: - name: NODE_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: spec.nodeName image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy:v1.15.10 imagePullPolicy: IfNotPresent name: kube-proxy
您的 kube-proxy 在 运行 节点中是否有足够的资源?您还可以尝试更改 kube-proxy DaemonSet 以为其提供更多资源(CPU、内存)
containers: - command: - /bin/sh - -c - kube-proxy --v=2 --config=/var/lib/kube-proxy-config/config --hostname-override=${NODE_NAME} env: - name: NODE_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: spec.nodeName image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy:v1.15.10 imagePullPolicy: IfNotPresent name: kube-proxy resources: requests: cpu: 300m <== this instead of 100m
您可以尝试在节点上启用iptables logging。检查数据包是否由于某种原因被丢弃。
最后这个问题是由于 Pod 配置不正确导致负载均衡器将流量路由到它:
selector:
matchLabels:
app: redis-cli
有 5 个 nginx pods 正确接收流量,一个 utility Pod 错误接收流量并像您预期的那样通过拒绝连接进行响应。
感谢您的回复。