JupyterHub pod 不再连接到 Postgres pod
JupyterHub pod no longer connects to Postgres pod
我有一个 kubernetes 集群,其中包含一个 jupyterhub pod 和一个 postgresql pod 作为其数据库。几个月来一切正常,直到最近发生共享存储 运行 已满的事件;由此产生的文件系统警告迫使连接的 linux 台机器(包括此集群节点)进入只读状态。现在,这个问题以及迄今为止由此产生的所有其他问题都可以得到解决;节点和 pods 似乎都启动正常,但 jupyterhub pod 单独遇到 CrashLoopBackoff,因为由于某种原因它无法再连接到数据库 service/pod.
这是我目前收集到的相关 pods 的日志。出于显而易见的原因,我已经编辑了用户名和密码,但我已经检查它们是否在 pods 之间对齐。如前所述,我在事件发生前没有更改配置和系统 运行。
kubectl logs <jupyterhub> | tail
[I 2022-01-22 08:04:28.905 JupyterHub app:2349] Running JupyterHub version 1.3.0
[I 2022-01-22 08:04:28.906 JupyterHub app:2379] Using Authenticator: builtins.MyAuthenticator
[I 2022-01-22 08:04:28.906 JupyterHub app:2379] Using Spawner: builtins.MySpawner
[I 2022-01-22 08:04:28.906 JupyterHub app:2379] Using Proxy: jupyterhub.proxy.ConfigurableHTTPProxy-1.3.0
[I 2022-01-22 08:04:28.981 JupyterHub app:1465] Writing cookie_secret to /jhub/jupyterhub_cookie_secret
[E 2022-01-22 08:04:39.048 JupyterHub app:1597] Failed to connect to db: postgresql://[redacted]:[redacted]@postgres:1500
[C 2022-01-22 08:04:39.049 JupyterHub app:1601] If you recently upgraded JupyterHub, try running
jupyterhub upgrade-db
to upgrade your JupyterHub database schema
数据库本身似乎 运行 没问题。
kubectl logs <postgres> | tail
2022-01-19 13:47:50.245 UTC [1] LOG: starting PostgreSQL 14.1 (Debian 14.1-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2022-01-19 13:47:50.245 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 1500
2022-01-19 13:47:50.245 UTC [1] LOG: listening on IPv6 address "::", port 1500
2022-01-19 13:47:50.380 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.1500"
2022-01-19 13:47:50.494 UTC [62] LOG: database system was shut down at 2022-01-19 13:47:49 UTC
2022-01-19 13:47:50.535 UTC [1] LOG: database system is ready to accept connections
服务也是如此:
kubectl describe service postgres
Name: postgres
Namespace: jhub
Labels: <none>
Annotations: <none>
Selector: app=postgres
Type: ClusterIP
IP Families: <none>
IP: 10.100.209.184
IPs: 10.100.209.184
Port: <unset> 1500/TCP
TargetPort: 1500/TCP
Endpoints: 10.0.0.139:1500
Session Affinity: None
Events: <none>
供参考,相关yamls。
postgres.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres
labels:
app: postgres
spec:
replicas: 1
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres
ports:
- containerPort: 1500
env:
- name: POSTGRES_USER
value: <redacted>
- name: POSTGRES_PASSWORD
value: <redacted>
- name: PGPORT
value: '1500'
---
apiVersion: v1
kind: Service
metadata:
name: postgres
spec:
selector:
app: postgres
ports:
- protocol: TCP
port: 1500
targetPort: 1500
jupyterhub_config.py 中的数据库 url 也没有异常:
postgres_passwd = os.getenv('POSTGRES_PASSWORD')
c.JupyterHub.db_url = f'postgresql://redacted:{postgres_passwd}@postgres:1500'
这就是我认为现在相关的一切;如果您还需要更多,请告诉我。
我有点难过。正如我所说,这主要是事实,在事件发生之前一切 运行 都很好。整体集群配置没有改变。所有其他问题都有一些外部因素作为原因,并且可以通过它来识别和修复,但这里的问题似乎完全包含在集群中。
感谢阅读,感谢任何帮助或提示。
E:到目前为止采取的行动列表:
- 之前尝试过重启整个节点。
- 在同一个命名空间中启动了一个 postgresclient,并尝试连接 jupyterhub_config 中的 URI;给了我一个“无法 t运行slate 主机名“postgres”来解决:名称解析暂时失败”。
- 运行 下来 DNS Troubleshooting 列表,我发现了以下问题:
kubectl get endpoints kube-dns --namespace=kube-system
NAME ENDPOINTS AGE
kube-dns 260d
kubectl describe endpoints kube-dns --namespace=kube-system
Name: kube-dns
Namespace: kube-system
Labels: k8s-app=kube-dns
kubernetes.io/cluster-service=true
kubernetes.io/name=KubeDNS
Annotations: endpoints.kubernetes.io/last-change-trigger-time: 2022-01-10T13:09:07Z
Subsets:
Addresses: <none>
NotReadyAddresses: 10.0.6.78,10.0.9.76
Ports:
Name Port Protocol
---- ---- --------
dns-tcp 53 TCP
dns 53 UDP
metrics 9153 TCP
Events: <none>
kubectl logs --namespace=kube-system -l k8s-app=kube-dns
Error from server: Get "https://141.83.188.131:10250/containerLogs/kube-system/coredns-74ff55c5b-n76km/coredns?tailLines=10": dial tcp 141.83.188.131:10250: connect: connection refused
kubectl describe --namespace=kube-system -l k8s-app=kube-dns
error: You must specify the type of resource to describe. Use "kubectl api-resources" for a complete list of supported resources.
bergmann@k8s-manager:~/jupyterhub$ kubectl describe pod --namespace=kube-system -l k8s-app=kube-dns
Name: coredns-74ff55c5b-n76km
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: k8s-worker-08/141.83.188.131
Start Time: Mon, 06 Dec 2021 01:40:35 +0100
Labels: k8s-app=kube-dns
pod-template-hash=74ff55c5b
Annotations: <none>
Status: Running
IP: 10.0.6.78
IPs:
IP: 10.0.6.78
Controlled By: ReplicaSet/coredns-74ff55c5b
Containers:
coredns:
Container ID: docker://d7239ff0f11295180ebff1434bc8a0dcb357a5d55128e8cf02b2b821822da6b3
Image: k8s.gcr.io/coredns:1.7.0
Image ID: docker-pullable://k8s.gcr.io/coredns@sha256:73ca82b4ce829766d4f1f10947c3a338888f876fbed0540dc849c89ff256e90c
Ports: 53/UDP, 53/TCP, 9153/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
-conf
/etc/coredns/Corefile
State: Running
Started: Mon, 06 Dec 2021 01:40:41 +0100
Ready: True
Restart Count: 0
Limits:
memory: 170Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
Readiness: http-get http://:8181/ready delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/etc/coredns from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from coredns-token-gzml9 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady True
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: coredns
Optional: false
coredns-token-gzml9:
Type: Secret (a volume populated by a Secret)
SecretName: coredns-token-gzml9
Optional: false
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: CriticalAddonsOnly op=Exists
node-role.kubernetes.io/control-plane:NoSchedule
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
Name: coredns-74ff55c5b-vv8v7
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: k8s-worker-10/141.83.188.161
Start Time: Mon, 06 Dec 2021 01:49:21 +0100
Labels: k8s-app=kube-dns
pod-template-hash=74ff55c5b
Annotations: <none>
Status: Running
IP: 10.0.9.76
IPs:
IP: 10.0.9.76
Controlled By: ReplicaSet/coredns-74ff55c5b
Containers:
coredns:
Container ID: docker://986105a2646ecdadf6fadbd700b9fdbeb578325603ee8353e5283b2b65967c23
Image: k8s.gcr.io/coredns:1.7.0
Image ID: docker-pullable://k8s.gcr.io/coredns@sha256:73ca82b4ce829766d4f1f10947c3a338888f876fbed0540dc849c89ff256e90c
Ports: 53/UDP, 53/TCP, 9153/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
-conf
/etc/coredns/Corefile
State: Running
Started: Mon, 06 Dec 2021 01:49:27 +0100
Ready: True
Restart Count: 0
Limits:
memory: 170Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
Readiness: http-get http://:8181/ready delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/etc/coredns from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from coredns-token-gzml9 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady True
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: coredns
Optional: false
coredns-token-gzml9:
Type: Secret (a volume populated by a Secret)
SecretName: coredns-token-gzml9
Optional: false
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: CriticalAddonsOnly op=Exists
node-role.kubernetes.io/control-plane:NoSchedule
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
将问题追溯到 kube-dns pods 后,我重新启动了它们。这解决了这个问题,虽然我仍然不知道它为什么会发生。
我有一个 kubernetes 集群,其中包含一个 jupyterhub pod 和一个 postgresql pod 作为其数据库。几个月来一切正常,直到最近发生共享存储 运行 已满的事件;由此产生的文件系统警告迫使连接的 linux 台机器(包括此集群节点)进入只读状态。现在,这个问题以及迄今为止由此产生的所有其他问题都可以得到解决;节点和 pods 似乎都启动正常,但 jupyterhub pod 单独遇到 CrashLoopBackoff,因为由于某种原因它无法再连接到数据库 service/pod.
这是我目前收集到的相关 pods 的日志。出于显而易见的原因,我已经编辑了用户名和密码,但我已经检查它们是否在 pods 之间对齐。如前所述,我在事件发生前没有更改配置和系统 运行。
kubectl logs <jupyterhub> | tail
[I 2022-01-22 08:04:28.905 JupyterHub app:2349] Running JupyterHub version 1.3.0
[I 2022-01-22 08:04:28.906 JupyterHub app:2379] Using Authenticator: builtins.MyAuthenticator
[I 2022-01-22 08:04:28.906 JupyterHub app:2379] Using Spawner: builtins.MySpawner
[I 2022-01-22 08:04:28.906 JupyterHub app:2379] Using Proxy: jupyterhub.proxy.ConfigurableHTTPProxy-1.3.0
[I 2022-01-22 08:04:28.981 JupyterHub app:1465] Writing cookie_secret to /jhub/jupyterhub_cookie_secret
[E 2022-01-22 08:04:39.048 JupyterHub app:1597] Failed to connect to db: postgresql://[redacted]:[redacted]@postgres:1500
[C 2022-01-22 08:04:39.049 JupyterHub app:1601] If you recently upgraded JupyterHub, try running
jupyterhub upgrade-db
to upgrade your JupyterHub database schema
数据库本身似乎 运行 没问题。
kubectl logs <postgres> | tail
2022-01-19 13:47:50.245 UTC [1] LOG: starting PostgreSQL 14.1 (Debian 14.1-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2022-01-19 13:47:50.245 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 1500
2022-01-19 13:47:50.245 UTC [1] LOG: listening on IPv6 address "::", port 1500
2022-01-19 13:47:50.380 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.1500"
2022-01-19 13:47:50.494 UTC [62] LOG: database system was shut down at 2022-01-19 13:47:49 UTC
2022-01-19 13:47:50.535 UTC [1] LOG: database system is ready to accept connections
服务也是如此:
kubectl describe service postgres
Name: postgres
Namespace: jhub
Labels: <none>
Annotations: <none>
Selector: app=postgres
Type: ClusterIP
IP Families: <none>
IP: 10.100.209.184
IPs: 10.100.209.184
Port: <unset> 1500/TCP
TargetPort: 1500/TCP
Endpoints: 10.0.0.139:1500
Session Affinity: None
Events: <none>
供参考,相关yamls。
postgres.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres
labels:
app: postgres
spec:
replicas: 1
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres
ports:
- containerPort: 1500
env:
- name: POSTGRES_USER
value: <redacted>
- name: POSTGRES_PASSWORD
value: <redacted>
- name: PGPORT
value: '1500'
---
apiVersion: v1
kind: Service
metadata:
name: postgres
spec:
selector:
app: postgres
ports:
- protocol: TCP
port: 1500
targetPort: 1500
jupyterhub_config.py 中的数据库 url 也没有异常:
postgres_passwd = os.getenv('POSTGRES_PASSWORD')
c.JupyterHub.db_url = f'postgresql://redacted:{postgres_passwd}@postgres:1500'
这就是我认为现在相关的一切;如果您还需要更多,请告诉我。
我有点难过。正如我所说,这主要是事实,在事件发生之前一切 运行 都很好。整体集群配置没有改变。所有其他问题都有一些外部因素作为原因,并且可以通过它来识别和修复,但这里的问题似乎完全包含在集群中。
感谢阅读,感谢任何帮助或提示。
E:到目前为止采取的行动列表:
- 之前尝试过重启整个节点。
- 在同一个命名空间中启动了一个 postgresclient,并尝试连接 jupyterhub_config 中的 URI;给了我一个“无法 t运行slate 主机名“postgres”来解决:名称解析暂时失败”。
- 运行 下来 DNS Troubleshooting 列表,我发现了以下问题:
kubectl get endpoints kube-dns --namespace=kube-system
NAME ENDPOINTS AGE
kube-dns 260d
kubectl describe endpoints kube-dns --namespace=kube-system
Name: kube-dns
Namespace: kube-system
Labels: k8s-app=kube-dns
kubernetes.io/cluster-service=true
kubernetes.io/name=KubeDNS
Annotations: endpoints.kubernetes.io/last-change-trigger-time: 2022-01-10T13:09:07Z
Subsets:
Addresses: <none>
NotReadyAddresses: 10.0.6.78,10.0.9.76
Ports:
Name Port Protocol
---- ---- --------
dns-tcp 53 TCP
dns 53 UDP
metrics 9153 TCP
Events: <none>
kubectl logs --namespace=kube-system -l k8s-app=kube-dns
Error from server: Get "https://141.83.188.131:10250/containerLogs/kube-system/coredns-74ff55c5b-n76km/coredns?tailLines=10": dial tcp 141.83.188.131:10250: connect: connection refused
kubectl describe --namespace=kube-system -l k8s-app=kube-dns
error: You must specify the type of resource to describe. Use "kubectl api-resources" for a complete list of supported resources.
bergmann@k8s-manager:~/jupyterhub$ kubectl describe pod --namespace=kube-system -l k8s-app=kube-dns
Name: coredns-74ff55c5b-n76km
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: k8s-worker-08/141.83.188.131
Start Time: Mon, 06 Dec 2021 01:40:35 +0100
Labels: k8s-app=kube-dns
pod-template-hash=74ff55c5b
Annotations: <none>
Status: Running
IP: 10.0.6.78
IPs:
IP: 10.0.6.78
Controlled By: ReplicaSet/coredns-74ff55c5b
Containers:
coredns:
Container ID: docker://d7239ff0f11295180ebff1434bc8a0dcb357a5d55128e8cf02b2b821822da6b3
Image: k8s.gcr.io/coredns:1.7.0
Image ID: docker-pullable://k8s.gcr.io/coredns@sha256:73ca82b4ce829766d4f1f10947c3a338888f876fbed0540dc849c89ff256e90c
Ports: 53/UDP, 53/TCP, 9153/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
-conf
/etc/coredns/Corefile
State: Running
Started: Mon, 06 Dec 2021 01:40:41 +0100
Ready: True
Restart Count: 0
Limits:
memory: 170Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
Readiness: http-get http://:8181/ready delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/etc/coredns from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from coredns-token-gzml9 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady True
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: coredns
Optional: false
coredns-token-gzml9:
Type: Secret (a volume populated by a Secret)
SecretName: coredns-token-gzml9
Optional: false
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: CriticalAddonsOnly op=Exists
node-role.kubernetes.io/control-plane:NoSchedule
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
Name: coredns-74ff55c5b-vv8v7
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: k8s-worker-10/141.83.188.161
Start Time: Mon, 06 Dec 2021 01:49:21 +0100
Labels: k8s-app=kube-dns
pod-template-hash=74ff55c5b
Annotations: <none>
Status: Running
IP: 10.0.9.76
IPs:
IP: 10.0.9.76
Controlled By: ReplicaSet/coredns-74ff55c5b
Containers:
coredns:
Container ID: docker://986105a2646ecdadf6fadbd700b9fdbeb578325603ee8353e5283b2b65967c23
Image: k8s.gcr.io/coredns:1.7.0
Image ID: docker-pullable://k8s.gcr.io/coredns@sha256:73ca82b4ce829766d4f1f10947c3a338888f876fbed0540dc849c89ff256e90c
Ports: 53/UDP, 53/TCP, 9153/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
-conf
/etc/coredns/Corefile
State: Running
Started: Mon, 06 Dec 2021 01:49:27 +0100
Ready: True
Restart Count: 0
Limits:
memory: 170Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
Readiness: http-get http://:8181/ready delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/etc/coredns from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from coredns-token-gzml9 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady True
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: coredns
Optional: false
coredns-token-gzml9:
Type: Secret (a volume populated by a Secret)
SecretName: coredns-token-gzml9
Optional: false
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: CriticalAddonsOnly op=Exists
node-role.kubernetes.io/control-plane:NoSchedule
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
将问题追溯到 kube-dns pods 后,我重新启动了它们。这解决了这个问题,虽然我仍然不知道它为什么会发生。