我的 AKS 群集已关闭,我该如何恢复?

My AKS Cluster was brought down, how can I recover?

我一直在尝试在 AKS 的单个代理集群上对我的应用程序进行负载测试。在测试期间,与仪表板的连接停止并且从未恢复。我的应用程序似乎也出现故障,所以我假设集群处于错误状态。

API 服务器正在恢复-f4cbd3d9.hcp.centralus。azmk8s.io

kubectl cluster-info dump 显示以下错误:

{
    "name": "kube-dns-v20-6c8f7f988b-9wpx9.14fbbbd6bf60f0cf",
    "namespace": "kube-system",
    "selfLink": "/api/v1/namespaces/kube-system/events/kube-dns-v20-6c8f7f988b-9wpx9.14fbbbd6bf60f0cf",
    "uid": "47f57d3c-d577-11e7-88d4-0a58ac1f0249",
    "resourceVersion": "185572",
    "creationTimestamp": "2017-11-30T02:36:34Z",
    "InvolvedObject": {
        "Kind": "Pod",
        "Namespace": "kube-system",
        "Name": "kube-dns-v20-6c8f7f988b-9wpx9",
        "UID": "9d2b20f2-d3f5-11e7-88d4-0a58ac1f0249",
        "APIVersion": "v1",
        "ResourceVersion": "299",
        "FieldPath": "spec.containers{kubedns}"
    },
    "Reason": "Unhealthy",
    "Message": "Liveness probe failed: Get http://10.244.0.4:8080/healthz-kubedns: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)",
    "Source": {
        "Component": "kubelet",
        "Host": "aks-agentpool-34912234-0"
    },
    "FirstTimestamp": "2017-11-30T02:23:50Z",
    "LastTimestamp": "2017-11-30T02:59:00Z",
    "Count": 6,
    "Type": "Warning"
}

以及 Kube-System 中的一些 Pod 同步错误。

问题示例:

az aks browse -g REstate.Server -n REstate

Merged "REstate" as current context in C:\Users\User\AppData\Local\Temp\tmp29d0conq

Proxy running on http://127.0.0.1:8001/
Press CTRL+C to close the tunnel...
error: error upgrading connection: error dialing backend: dial tcp 10.240.0.4:10250: getsockopt: connection timed out

您可能需要通过 ssh 连接到节点以查看 Kubelet 服务是否 运行。将来您可以设置资源配额,以免耗尽集群节点中的所有资源。

资源配额 -https://kubernetes.io/docs/concepts/policy/resource-quotas/