传输:拨号 tcp xx.xx.xx.xx15012 时出错:i/o 使用 AWS-EKS + Terraform + Istio 超时

transport: Error while dialing dial tcp xx.xx.xx.xx15012: i/o timeout with AWS-EKS + Terraform + Istio

我使用 terraform-aws-eks 设置了一个(我认为的)沼泽标准 EKS 集群,如下所示:

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 18.0"

  cluster_name    = "my-test-cluster"
  cluster_version = "1.21"

  cluster_endpoint_private_access = true
  cluster_endpoint_public_access  = true

  cluster_addons = {
    coredns = {
      resolve_conflicts = "OVERWRITE"
    }
    kube-proxy = {}
    vpc-cni = {
      resolve_conflicts = "OVERWRITE"
    }
  }

  vpc_id     = var.vpc_id
  subnet_ids = var.subnet_ids

  eks_managed_node_group_defaults = {
    disk_size      = 50
    instance_types = ["m5.large"]
  }

  eks_managed_node_groups = {
    green_test = {
      min_size     = 1
      max_size     = 2
      desired_size = 2

      instance_types = ["t3.large"]
      capacity_type  = "SPOT"
    }
  }
}

然后尝试通过 install docs

安装 Istio
istioctl install

结果是:

✔ Istio core installed
✔ Istiod installed
✘ Ingress gateways encountered an error: failed to wait for resource: resources not ready after 5m0s: timed out waiting for the condition
  Deployment/istio-system/istio-ingressgateway (containers with unready status: [istio-proxy])
- Pruning removed resources                                                                                    Error: failed to install manifests: errors occurred during operation

所以我做了一些挖掘:

kubectl logs istio-ingressgateway-7fd568fc99-6ql8h -n istio-system

导致

2022-04-17T13:51:14.540346Z warn    ca  ca request failed, starting attempt 1 in 90.275446ms
2022-04-17T13:51:14.631695Z warn    ca  ca request failed, starting attempt 2 in 195.118437ms
2022-04-17T13:51:14.827286Z warn    ca  ca request failed, starting attempt 3 in 394.627125ms
2022-04-17T13:51:15.222738Z warn    ca  ca request failed, starting attempt 4 in 816.437569ms
2022-04-17T13:51:16.039427Z warn    sds failed to warm certificate: failed to generate workload certificate: create certificate: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.20.55.247:15012: i/o timeout"
2022-04-17T13:51:33.941084Z warning envoy config    StreamAggregatedResources gRPC config stream closed since 318s ago: 14, connection error: desc = "transport: Error while dialing dial tcp 172.20.55.247:15012: i/o timeout"
2022-04-17T13:52:05.830859Z warning envoy config    StreamAggregatedResources gRPC config stream closed since 350s ago: 14, connection error: desc = "transport: Error while dialing dial tcp 172.20.55.247:15012: i/o timeout"
2022-04-17T13:52:26.232441Z warning envoy config    StreamAggregatedResources gRPC config stream closed since 370s ago: 14, connection error: desc = "transport: Error while dialing dial tcp 172.20.55.247:15012: i/o timeout"

所以从大量阅读来看,似乎 istio-ingressgateway pod 无法连接到 istiod?

Google 时间,我找到了这个:https://istio.io/latest/docs/ops/diagnostic-tools/proxy-cmd/#verifying-connectivity-to-istiod

kubectl create namespace foo
kubectl apply -f <(istioctl kube-inject -f samples/sleep/sleep.yaml) -n foo

kubectl exec $(kubectl get pod -l app=sleep -n foo -o jsonpath={.items..metadata.name}) -c sleep -n foo -- curl -sS istiod.istio-system:15014/version

这给了我:

curl: (7) Failed to connect to istiod.istio-system port 15014 after 4 ms: Connection refused
command terminated with exit code 7

所以我认为这个问题不是 istio-ingressgateway 特有的,而是标准 EKS 集群中更普遍的网络问题?

  1. 我如何从这里着手调试,找出问题所在?是否有很好的资源来了解kubernetes和istio的网络模型?
  2. istio platform docs 为什么会放弃 EKS? istio 团队不希望 istio 在 AWS-EKS 上 运行 吗?
  3. 这看起来像是一个应该针对 EKS 提出的问题吗? aws-eks Terraform 模块?伊斯蒂奥?我不确定它的确切位置,而且似乎如果我向一个团队寻求帮助 - 几乎肯定需要另一个团队的帮助。
  4. 是否存在我应该注意的与 Istio 和 EKS 的已知不兼容性?

提前致谢!

[22-04-18] 更新 1:

好的,所以使用 foo 命名空间 sleep pod 的测试让我相信连接超时与 aws 安全组规则有关。理论上,如果未打开安全组端口,您会看到我看到的那种“连接被拒绝”“io 超时”消息。为了测试这个理论,我使用了这个模块创建的 4 个安全组

  1. k8s/EKS/Amazon新加坡
  2. EKS ENI SG
  3. EKS 集群 SG
  4. EKS 共享节点组 SG

并打开所有流量inbound/outbound。

istioctl install
This will install the Istio 1.13.2 default profile with ["Istio core" "Istiod" "Ingress gateways"] components into the cluster. Proceed? (y/N) y
✔ Istio core installed
✔ Istiod installed
✔ Ingress gateways installed
✔ Installation complete                                                                                                                                                       Making this installation the default for injection and validation.

中提琴!好的,现在我认为我需要向后工作并隔离 - 哪些 - 端口以及将它们应用到哪个安全组,以及它们是在入站还是出站端。一旦我有了这些,我就可以将它 PR 回 terraform-aws-eks 并为其他人节省 小时 的头痛。

[22-04-22] 更新二:

最终,我解决了这个问题 - 但是 运行 变成了一个非常常见的问题,我看到很多其他人 运行 遇到了这个问题,并得到了答案,但不是可用的格式terraform-aws-eks模块。

在我能够让 istioctl 安装正常工作之后:

istioctl install --set profile=demo
✔ Istio core installed
✔ Istiod installed
✔ Ingress gateways installed
✔ Installation complete                                                                                                                                                       Making this installation the default for injection and validation.

kubectl label namespace default istio-injection=enabled

kubectl apply -f istio-1.13.2/samples/bookinfo/platform/kube/bookinfo.yaml

我看到所有的 bookinfo pods/deployments 都无法以此开头:

Internal error occurred: failed calling 
webhook "namespace.sidecar-injector.istio.io": failed to 
call webhook: Post "https://istiod.istio-system.svc:443
/inject?timeout=10s": context deadline exceeded

此问题的答案与原始问题类似:working fw ports/security group rules。为了清楚起见,我在下面添加了一个单独的答案。它包含 AWS-EKS + Terraform + Istio

的完整工作解决方案

这也是裸机集群中的常见错误。 在大多数情况下,这是由于 RAM 上的内存限制所致。要隔离问题,请尝试使用比演示更小的配置文件。

istioctl install profile=minimal -y 

BLUF: 安装 Istio on terraform-aws-eks 需要您添加安全组规则以允许在节点组内进行通信。您需要:

  1. 在共享节点安全组中添加安全组规则(ingress/egress)以打开istio ports istio 以正确安装
  2. 在节点安全组上添加一个入口安全组规则,来自 15017 的控制平面 (EKS) 安全组,以解决 failed calling webhook "namespace.sidecar-injector.istio.io" 错误。

不幸的是,我仍然不知道为什么这行得通,因为我还不了解当 istio 注入的 pod 在 kubernetes 集群中出现时发生的操作顺序,以及谁试图与谁交谈。

研究资源

  1. A diagram of the security group architecture for an EKS cluster created by terraform-aws-eks
  2. The ports Istio needs open
  3. A youtube video explaining CNI
  4. The ports Kubernetes uses

工作示例

请参阅评论,了解哪些规则集解决了原始答案中的两个问题中的哪一个

# Ports needed to correctly install Istio for the error message: transport: Error while dialing dial tcp xx.xx.xx.xx15012: i/o timeout
locals {
  istio_ports = [
    {
      description = "Envoy admin port / outbound"
      from_port   = 15000
      to_port     = 15001
    },
    {
      description = "Debug port"
      from_port   = 15004
      to_port     = 15004
    },
    {
      description = "Envoy inbound"
      from_port   = 15006
      to_port     = 15006
    },
    {
      description = "HBONE mTLS tunnel port / secure networks XDS and CA services (Plaintext)"
      from_port   = 15008
      to_port     = 15010
    },
    {
      description = "XDS and CA services (TLS and mTLS)"
      from_port   = 15012
      to_port     = 15012
    },
    {
      description = "Control plane monitoring"
      from_port   = 15014
      to_port     = 15014
    },
    {
      description = "Webhook container port, forwarded from 443"
      from_port   = 15017
      to_port     = 15017
    },
    {
      description = "Merged Prometheus telemetry from Istio agent, Envoy, and application, Health checks"
      from_port   = 15020
      to_port     = 15021
    },
    {
      description = "DNS port"
      from_port   = 15053
      to_port     = 15053
    },
    {
      description = "Envoy Prometheus telemetry"
      from_port   = 15090
      to_port     = 15090
    },
    {
      description = "aws-load-balancer-controller"
      from_port   = 9443
      to_port     = 9443
    }
  ]

  ingress_rules = {
    for ikey, ivalue in local.istio_ports :
    "${ikey}_ingress" => {
      description = ivalue.description
      protocol    = "tcp"
      from_port   = ivalue.from_port
      to_port     = ivalue.to_port
      type        = "ingress"
      self        = true
    }
  }

  egress_rules = {
    for ekey, evalue in local.istio_ports :
    "${ekey}_egress" => {
      description = evalue.description
      protocol    = "tcp"
      from_port   = evalue.from_port
      to_port     = evalue.to_port
      type        = "egress"
      self        = true
    }
  }
}

# The AWS-EKS Module definition
module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 18.0"

  cluster_name    = "my-test-cluster"
  cluster_version = "1.21"

  cluster_endpoint_private_access = true
  cluster_endpoint_public_access  = true

  cluster_addons = {
    coredns = {
      resolve_conflicts = "OVERWRITE"
    }
    kube-proxy = {}
    vpc-cni = {
      resolve_conflicts = "OVERWRITE"
    }
  }

  vpc_id     = var.vpc_id
  subnet_ids = var.subnet_ids

  eks_managed_node_group_defaults = {
    disk_size      = 50
    instance_types = ["m5.large"]
  }

  # IMPORTANT
  node_security_group_additional_rules = merge(
    local.ingress_rules,
    local.egress_rules
  )

  eks_managed_node_groups = {
    green_test = {
      min_size     = 1
      max_size     = 2
      desired_size = 2

      instance_types = ["t3.large"]
      capacity_type  = "SPOT"
    }
  }
}

# Port needed to solve the error
# Internal error occurred: failed calling 
# webhook "namespace.sidecar-injector.istio.io": failed to 
# call webhook: Post "https://istiod.istio-system.svc:443/inject?timeout=10s": # context deadline exceeded
resource "aws_security_group_rule" "allow_sidecar_injection" {
  description = "Webhook container port, From Control Plane"
  protocol    = "tcp"
  type        = "ingress"
  from_port   = 15017
  to_port     = 15017

  security_group_id        = module.eks.node_security_group_id
  source_security_group_id = module.eks.cluster_primary_security_group_id
}

请原谅我可能糟糕的 Terraform 语法用法。 Kubernete 快乐!

@mitchellmc 很好地提出了问题,甚至更好地回答了问题!

正如他们所说,terraform-aws-eks by default does not allow network communication between the nodes. To allow it and to avoid problems like these, you can do it by having this in your module inputs:

  node_security_group_additional_rules = {
    ingress_self_all = {
      description = "Node to node all ports/protocols"
      protocol    = "-1"
      from_port   = 0
      to_port     = 0
      type        = "ingress"
      self        = true
    }
    egress_all = {
      description      = "Node all egress"
      protocol         = "-1"
      from_port        = 0
      to_port          = 0
      type             = "egress"
      cidr_blocks      = ["0.0.0.0/0"]
      ipv6_cidr_blocks = ["::/0"]
    }
  }

如果您使用 Istio,AWS SG 几乎没有冗余,您应该知道自己在做什么。

快乐 Istioing :)