JupyterHub pod 不再连接到 Postgres pod

Question

我有一个 kubernetes 集群，其中包含一个 jupyterhub pod 和一个 postgresql pod 作为其数据库。几个月来一切正常，直到最近发生共享存储运行已满的事件；由此产生的文件系统警告迫使连接的 linux 台机器（包括此集群节点）进入只读状态。现在，这个问题以及迄今为止由此产生的所有其他问题都可以得到解决；节点和 pods 似乎都启动正常，但 jupyterhub pod 单独遇到 CrashLoopBackoff，因为由于某种原因它无法再连接到数据库 service/pod.

这是我目前收集到的相关 pods 的日志。出于显而易见的原因，我已经编辑了用户名和密码，但我已经检查它们是否在 pods 之间对齐。如前所述，我在事件发生前没有更改配置和系统运行。

kubectl logs <jupyterhub> | tail

[I 2022-01-22 08:04:28.905 JupyterHub app:2349] Running JupyterHub version 1.3.0
[I 2022-01-22 08:04:28.906 JupyterHub app:2379] Using Authenticator: builtins.MyAuthenticator
[I 2022-01-22 08:04:28.906 JupyterHub app:2379] Using Spawner: builtins.MySpawner
[I 2022-01-22 08:04:28.906 JupyterHub app:2379] Using Proxy: jupyterhub.proxy.ConfigurableHTTPProxy-1.3.0
[I 2022-01-22 08:04:28.981 JupyterHub app:1465] Writing cookie_secret to /jhub/jupyterhub_cookie_secret
[E 2022-01-22 08:04:39.048 JupyterHub app:1597] Failed to connect to db: postgresql://[redacted]:[redacted]@postgres:1500
[C 2022-01-22 08:04:39.049 JupyterHub app:1601] If you recently upgraded JupyterHub, try running
        jupyterhub upgrade-db
    to upgrade your JupyterHub database schema

数据库本身似乎运行没问题。

kubectl logs <postgres> | tail

2022-01-19 13:47:50.245 UTC [1] LOG:  starting PostgreSQL 14.1 (Debian 14.1-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2022-01-19 13:47:50.245 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 1500
2022-01-19 13:47:50.245 UTC [1] LOG:  listening on IPv6 address "::", port 1500
2022-01-19 13:47:50.380 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.1500"
2022-01-19 13:47:50.494 UTC [62] LOG:  database system was shut down at 2022-01-19 13:47:49 UTC
2022-01-19 13:47:50.535 UTC [1] LOG:  database system is ready to accept connections

服务也是如此：

kubectl describe service postgres

Name:              postgres
Namespace:         jhub
Labels:            <none>
Annotations:       <none>
Selector:          app=postgres
Type:              ClusterIP
IP Families:       <none>
IP:                10.100.209.184
IPs:               10.100.209.184
Port:              <unset>  1500/TCP
TargetPort:        1500/TCP
Endpoints:         10.0.0.139:1500
Session Affinity:  None
Events:            <none>

供参考，相关yamls。


postgres.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
  labels:
    app: postgres
spec:
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: postgres
          ports:
            - containerPort: 1500
          env:
            - name: POSTGRES_USER
              value: <redacted>
            - name: POSTGRES_PASSWORD
              value: <redacted>
            - name: PGPORT
              value: '1500'

---

apiVersion: v1
kind: Service
metadata:
  name: postgres
spec:
  selector:
    app: postgres
  ports:
    - protocol: TCP
      port: 1500
      targetPort: 1500

jupyterhub_config.py 中的数据库 url 也没有异常：

postgres_passwd = os.getenv('POSTGRES_PASSWORD')
c.JupyterHub.db_url = f'postgresql://redacted:{postgres_passwd}@postgres:1500'

这就是我认为现在相关的一切；如果您还需要更多，请告诉我。

我有点难过。正如我所说，这主要是事实，在事件发生之前一切运行都很好。整体集群配置没有改变。所有其他问题都有一些外部因素作为原因，并且可以通过它来识别和修复，但这里的问题似乎完全包含在集群中。

感谢阅读，感谢任何帮助或提示。

E：到目前为止采取的行动列表：

之前尝试过重启整个节点。
在同一个命名空间中启动了一个 postgresclient，并尝试连接 jupyterhub_config 中的 URI；给了我一个“无法 t运行slate 主机名“postgres”来解决：名称解析暂时失败”。
运行下来 DNS Troubleshooting 列表，我发现了以下问题：

kubectl get endpoints kube-dns --namespace=kube-system
NAME       ENDPOINTS   AGE
kube-dns               260d

kubectl describe endpoints kube-dns --namespace=kube-system
Name:         kube-dns
Namespace:    kube-system
Labels:       k8s-app=kube-dns
              kubernetes.io/cluster-service=true
              kubernetes.io/name=KubeDNS
Annotations:  endpoints.kubernetes.io/last-change-trigger-time: 2022-01-10T13:09:07Z
Subsets:
  Addresses:          <none>
  NotReadyAddresses:  10.0.6.78,10.0.9.76
  Ports:
    Name     Port  Protocol
    ----     ----  --------
    dns-tcp  53    TCP
    dns      53    UDP
    metrics  9153  TCP

Events:  <none>

kubectl logs --namespace=kube-system -l k8s-app=kube-dns
Error from server: Get "https://141.83.188.131:10250/containerLogs/kube-system/coredns-74ff55c5b-n76km/coredns?tailLines=10": dial tcp 141.83.188.131:10250: connect: connection refused

kubectl describe --namespace=kube-system -l k8s-app=kube-dns
error: You must specify the type of resource to describe. Use "kubectl api-resources" for a complete list of supported resources.
bergmann@k8s-manager:~/jupyterhub$ kubectl describe pod --namespace=kube-system -l k8s-app=kube-dns
Name:                 coredns-74ff55c5b-n76km
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 k8s-worker-08/141.83.188.131
Start Time:           Mon, 06 Dec 2021 01:40:35 +0100
Labels:               k8s-app=kube-dns
                      pod-template-hash=74ff55c5b
Annotations:          <none>
Status:               Running
IP:                   10.0.6.78
IPs:
  IP:           10.0.6.78
Controlled By:  ReplicaSet/coredns-74ff55c5b
Containers:
  coredns:
    Container ID:  docker://d7239ff0f11295180ebff1434bc8a0dcb357a5d55128e8cf02b2b821822da6b3
    Image:         k8s.gcr.io/coredns:1.7.0
    Image ID:      docker-pullable://k8s.gcr.io/coredns@sha256:73ca82b4ce829766d4f1f10947c3a338888f876fbed0540dc849c89ff256e90c
    Ports:         53/UDP, 53/TCP, 9153/TCP
    Host Ports:    0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    State:          Running
      Started:      Mon, 06 Dec 2021 01:40:41 +0100
    Ready:          True
    Restart Count:  0
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:    http-get http://:8181/ready delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from coredns-token-gzml9 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   True
  PodScheduled      True
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns
    Optional:  false
  coredns-token-gzml9:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  coredns-token-gzml9
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     CriticalAddonsOnly op=Exists
                 node-role.kubernetes.io/control-plane:NoSchedule
                 node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>


Name:                 coredns-74ff55c5b-vv8v7
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 k8s-worker-10/141.83.188.161
Start Time:           Mon, 06 Dec 2021 01:49:21 +0100
Labels:               k8s-app=kube-dns
                      pod-template-hash=74ff55c5b
Annotations:          <none>
Status:               Running
IP:                   10.0.9.76
IPs:
  IP:           10.0.9.76
Controlled By:  ReplicaSet/coredns-74ff55c5b
Containers:
  coredns:
    Container ID:  docker://986105a2646ecdadf6fadbd700b9fdbeb578325603ee8353e5283b2b65967c23
    Image:         k8s.gcr.io/coredns:1.7.0
    Image ID:      docker-pullable://k8s.gcr.io/coredns@sha256:73ca82b4ce829766d4f1f10947c3a338888f876fbed0540dc849c89ff256e90c
    Ports:         53/UDP, 53/TCP, 9153/TCP
    Host Ports:    0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    State:          Running
      Started:      Mon, 06 Dec 2021 01:49:27 +0100
    Ready:          True
    Restart Count:  0
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:    http-get http://:8181/ready delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from coredns-token-gzml9 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   True
  PodScheduled      True
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns
    Optional:  false
  coredns-token-gzml9:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  coredns-token-gzml9
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     CriticalAddonsOnly op=Exists
                 node-role.kubernetes.io/control-plane:NoSchedule
                 node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>

Answer 1

将问题追溯到 kube-dns pods 后，我重新启动了它们。这解决了这个问题，虽然我仍然不知道它为什么会发生。

JupyterHub pod 不再连接到 Postgres pod

JupyterHub pod no longer connects to Postgres pod

postgresql

kubernetes

jupyterhub