Kubernetes (K3S) POD 在播出 5-20 小时后获得 "ENOTFOUND"

Kubernetes (K3S) POD gets "ENOTFOUND" after 5-20 hours of airing time

我 运行 我的 Kubernetes 后端 250 pods ]15 部署,后端用 NODEJS.

编写

有时在 X 小时后 (5ENOTFOUND,如下所示:

{
  "name": "main",
  "hostname": "entrypoint-sdk-54c8788caa-aa3cj",
  "pid": 19,
  "level": 50,
  "error": {
    "errno": -3008,
    "code": "ENOTFOUND",
    "syscall": "getaddrinfo",
    "hostname": "employees-service"
  },
  "msg": "Failed calling getEmployee",
  "time": "2022-01-28T13:44:36.549Z",
  "v": 0
}

我正在 运行 对每秒 YY 用户数的后端进行压力测试,但我保持这个压力水平稳定并且没有改变它,然后它突然发生了,没有具体原因

Kubernetes是K3S服务器版本:v1.21.5+k3s2

知道是什么导致了这种奇怪的情况 ENOTFOUND 吗?

已经看到你的 same question on github and reference to getaddrinfo ENOTFOUND with newest versions.

根据评论,此问题未出现在 k3s 1.21 中,即比您低 1 个版本。我知道这几乎是不可能的,但是有机会在这个版本上尝试类似的设置吗?

似乎错误来自 node/lib/dns.js

function errnoException(err, syscall, hostname) {
  // FIXME(bnoordhuis) Remove this backwards compatibility nonsense and pass
  // the true error to the user. ENOTFOUND is not even a proper POSIX error!
  if (err === uv.UV_EAI_MEMORY ||
      err === uv.UV_EAI_NODATA ||
      err === uv.UV_EAI_NONAME) {
    err = 'ENOTFOUND';
  }

我想建议你检查一下 Solving DNS lookup failures in Kubernetes。文章描述了捕获您不时遇到的相同错误的长期困难方法。

作为调查所有指标、日志等后的解决方案 - 安装名为 Node Local DNS cache 的 K8s 集群 add-on,

improves Cluster DNS performance by running a dns caching agent on cluster nodes as a DaemonSet. In today's architecture, Pods in ClusterFirst DNS mode reach out to a kube-dns serviceIP for DNS queries. This is translated to a kube-dns/CoreDNS endpoint via iptables rules added by kube-proxy. With this new architecture, Pods will reach out to the dns caching agent running on the same node, thereby avoiding iptables DNAT rules and connection tracking. The local caching agent will query kube-dns service for cache misses of cluster hostnames(cluster.local suffix by default).

Motivation

  • With the current DNS architecture, it is possible that Pods with the highest DNS QPS have to reach out to a different node, if there is no local kube-dns/CoreDNS instance. Having a local cache will help improve the latency in such scenarios.
  • Skipping iptables DNAT and connection tracking will help reduce conntrack races and avoid UDP DNS entries filling up conntrack table.
  • Connections from local caching agent to kube-dns service can be upgraded to TCP. TCP conntrack entries will be removed on connection
    close in contrast with UDP entries that have to timeout (default
    nf_conntrack_udp_timeout is 30 seconds)
  • Upgrading DNS queries from UDP to TCP would reduce tail latency attributed to dropped UDP packets and DNS timeouts usually up to 30s
    (3 retries + 10s timeout). Since the nodelocal cache listens for UDP
    DNS queries, applications don't need to be changed.
  • Metrics & visibility into dns requests at a node level.
  • Negative caching can be re-enabled, thereby reducing number of queries to kube-dns service.