丢失了我的 openshift 控制台 ("Application is not available")

Lost my openshift console ("Application is not available")

我的 OpenShift 4.5.x 安装中的控制台 ui 神秘地停止工作。现在访问控制台 URL 会导致消息:

Application is not available

The application is currently not serving requests at this endpoint. It may not have been started or is still starting.

如果路由存在但找不到相应的服务或 pod,通常会看到这个,但在这种情况下,路由存在:

$ oc -n openshift-console get route
NAME        HOST/PORT                                             PATH   SERVICES    PORT    TERMINATION          WILDCARD
console     console-openshift-console.apps.example.com            console     https   reencrypt/Redirect   None
downloads   downloads-openshift-console.apps.example.com          downloads   http    edge/Redirect        None

服务存在:

$ oc -n openshift-console get service
NAME        TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
console     ClusterIP   172.30.36.70     <none>        443/TCP   57d
downloads   ClusterIP   172.30.190.186   <none>        80/TCP    57d

并且 pods 存在并且健康:

$ oc -n openshift-console get pods
NAME                       READY   STATUS    RESTARTS   AGE
console-76c8d7d755-gtfm8   0/1     Running   1          4m12s
console-76c8d7d755-mvf6n   0/1     Running   1          4m12s
downloads-9656c996-mmqhk   1/1     Running   0          53d
downloads-9656c996-z2khj   1/1     Running   0          53d

查看控制台 pods 的日志,联系 oauth 服务似乎有问题:

2021-01-04T22:05:48Z auth: error contacting auth provider (retrying in 10s): Get https://kubernetes.default.svc/.well-known/oauth-authorization-server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2021-01-04T22:05:58Z auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.example.com/oauth/token failed: Head https://oauth-openshift.apps.example.com: EOF
2021-01-04T22:06:13Z auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.example.com/oauth/token failed: Head https://oauth-openshift.apps.example.com: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2021-01-04T22:06:23Z auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.example.com/oauth/token failed: Head https://oauth-openshift.apps.example.com: EOF
2021-01-04T22:06:38Z auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.example.com/oauth/token failed: Head https://oauth-openshift.apps.example.com: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2021-01-04T22:06:53Z auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.example.com/oauth/token failed: Head https://oauth-openshift.apps.example.com: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

但是 openshift-authentication 命名空间中的 pods 似乎很健康,并且没有在日志中报告任何错误。我应该在哪里寻找问题的根源?


openshift-authentication 命名空间中存在预期的路由和服务:

$ oc -n openshift-authentication get route
NAME              HOST/PORT                                 PATH   SERVICES          PORT   TERMINATION            WILDCARD
oauth-openshift   oauth-openshift.apps.example.com          oauth-openshift   6443   passthrough/Redirect   None

$ oc -n openshift-authentication get service
NAME              TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
oauth-openshift   ClusterIP   172.30.233.202   <none>        443/TCP   57d

$ oc -n openshift-authentication get route oauth-openshift -o json | jq .status
{
  "ingress": [
    {
      "conditions": [
        {
          "lastTransitionTime": "2020-11-08T19:48:08Z",
          "status": "True",
          "type": "Admitted"
        }
      ],
      "host": "oauth-openshift.apps.example.com",
      "routerCanonicalHostname": "apps.example.com",
      "routerName": "default",
      "wildcardPolicy": "None"
    }
  ]
}

原来是默认入口路由器的问题。没有明显的错误,但我能够通过重新启动路由器来解决问题:

oc -n openshift-ingress get pod -o json |
  jq -r '.items[].metadata.name' |
  xargs oc -n openshift-ingress delete pod

我在 OpenShift 3.11 上遇到了同样的问题

我刚刚用证书删除了 secret,openshift 将创建新的 secret,现在控制台可以工作了。

oc delete secret console-serving-cert -n openshift-console