尝试创建 NSQ PetSet,pods 在容器启动后不久继续终止

Trying to create NSQ PetSet, pods keep terminating shortly after container launches

这里是完整的 yaml 文件(没有嵌入问题中,因为它相当长,而且很多重要的部分都包含在下面的 describe 中):

https://gist.github.com/sporkmonger/46a820f9a1ed8a73d89a319dffb24608

使用我在此处创建的 public 容器映像:sporkmonger/nsq-k8s:0.3.8

Container 与官方 NSQ 镜像相同,但使用 Debian Jessie 而不是 Alpine/musl 来解决 DNS 问题,这些问题往往是 Alpine-on-Kubernetes 的问题。

这是我描述其中一个 pods 时发生的情况:

❯ kubectl describe pod nsqd-0
Name:               nsqd-0
Namespace:          default
Node:               minikube/192.168.99.100
Start Time:         Sun, 04 Dec 2016 20:58:06 -0800
Labels:             app=nsq
Status:             Terminating (expires Sun, 04 Dec 2016 21:02:31 -0800)
Termination Grace Period:   60s
IP:             172.17.0.8
Controllers:            PetSet/nsqd
Containers:
  nsqd:
    Container ID:   docker://381e4a1313e4e13a63b8a17004d79a6e828a8bc1c9e20419b319f8a9757f266b
    Image:      sporkmonger/nsq-k8s:0.3.8
    Image ID:       docker://sha256:01691a91cee3e1a6992b33a10e99baa57c5b8ce7b765849540a830f0b554e707
    Ports:      4150/TCP, 4151/TCP
    Command:
      /bin/sh
      -c
    Args:
      /usr/local/bin/nsqd
      -data-path
      /data
      -broadcast-address
      $(hostname -f)
      -lookupd-tcp-address
      nsqlookupd-0.nsqlookupd.default.svc.cluster.local:4160
      -lookupd-tcp-address
      nsqlookupd-1.nsqlookupd.default.svc.cluster.local:4160
      -lookupd-tcp-address
      nsqlookupd-2.nsqlookupd.default.svc.cluster.local:4160
    State:      Running
      Started:      Sun, 04 Dec 2016 20:58:11 -0800
    Ready:      True
    Restart Count:  0
    Liveness:       http-get http://:http/ping delay=5s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:http/ping delay=1s timeout=1s period=10s #success=1 #failure=3
    Volume Mounts:
      /data from datadir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-k6ufj (ro)
    Environment Variables:  <none>
Conditions:
  Type      Status
  Initialized   True 
  Ready     True 
  PodScheduled  True 
Volumes:
  datadir:
    Type:   PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  datadir-nsqd-0
    ReadOnly:   false
  default-token-k6ufj:
    Type:   Secret (a volume populated by a Secret)
    SecretName: default-token-k6ufj
QoS Class:  BestEffort
Tolerations:    <none>
Events:
  FirstSeen LastSeen    Count   From            SubobjectPath       Type        Reason      Message
  --------- --------    -----   ----            -------------       --------    ------      -------
  4m        4m      1   {default-scheduler }                Normal      Scheduled   Successfully assigned nsqd-0 to minikube
  4m        4m      1   {kubelet minikube}  spec.containers{nsqd}   Normal      Pulling     pulling image "sporkmonger/nsq-k8s:0.3.8"
  4m        4m      1   {kubelet minikube}  spec.containers{nsqd}   Normal      Pulled      Successfully pulled image "sporkmonger/nsq-k8s:0.3.8"
  4m        4m      1   {kubelet minikube}  spec.containers{nsqd}   Normal      Created     Created container with docker id 381e4a1313e4; Security:[seccomp=unconfined]
  4m        4m      1   {kubelet minikube}  spec.containers{nsqd}   Normal      Started     Started container with docker id 381e4a1313e4
  0s        0s      1   {kubelet minikube}  spec.containers{nsqd}   Normal      Killing     Killing container with docker id 381e4a1313e4: Need to kill pod.

30秒左右集群比较有代表性的手表activity:

❯ kubectl get pods -w
NAME           READY     STATUS        RESTARTS   AGE
nsqadmin-0     1/1       Running       3          33m
nsqadmin-1     1/1       Running       0          32m
nsqd-0         1/1       Running       0          6m
nsqd-1         1/1       Running       0          4m
nsqd-2         1/1       Terminating   0          1m
nsqd-3         1/1       Running       0          30s
nsqlookupd-0   1/1       Running       0          30s
NAME           READY     STATUS    RESTARTS   AGE
nsqlookupd-1   0/1       Pending   0          0s
nsqlookupd-1   0/1       Pending   0         0s
nsqlookupd-1   0/1       ContainerCreating   0         0s
nsqlookupd-1   0/1       Running   0         4s
nsqlookupd-1   1/1       Running   0         8s
nsqlookupd-2   0/1       Pending   0         0s
nsqlookupd-2   0/1       Pending   0         0s
nsqlookupd-2   0/1       ContainerCreating   0         0s
nsqlookupd-2   0/1       Terminating   0         0s
nsqd-2    0/1       Terminating   0         2m
nsqd-2    0/1       Terminating   0         2m
nsqd-2    0/1       Terminating   0         2m
nsqlookupd-2   0/1       Terminating   0         4s
nsqlookupd-2   0/1       Terminating   0         5s
nsqlookupd-2   0/1       Terminating   0         5s
nsqlookupd-2   0/1       Terminating   0         5s
nsqlookupd-1   1/1       Terminating   0         29s
nsqlookupd-1   0/1       Terminating   0         30s
nsqlookupd-1   0/1       Terminating   0         30s
nsqlookupd-1   0/1       Terminating   0         30s
nsqlookupd-0   1/1       Terminating   0         1m
nsqd-2    0/1       Pending   0         0s
nsqd-2    0/1       Pending   0         0s
nsqd-2    0/1       ContainerCreating   0         0s
nsqlookupd-0   0/1       Terminating   0         1m
nsqlookupd-0   0/1       Terminating   0         1m
nsqlookupd-0   0/1       Terminating   0         1m
nsqlookupd-0   0/1       Pending   0         0s
nsqlookupd-0   0/1       Pending   0         0s
nsqlookupd-0   0/1       ContainerCreating   0         0s
nsqd-2    0/1       Running   0         4s
nsqlookupd-0   0/1       Running   0         4s
nsqd-2    1/1       Running   0         6s
nsqlookupd-0   1/1       Running   0         10s
nsqlookupd-0   1/1       Terminating   0         10s
nsqlookupd-0   0/1       Terminating   0         11s
nsqlookupd-0   0/1       Terminating   0         11s
nsqlookupd-0   0/1       Terminating   0         11s
nsqd-2    1/1       Terminating   0         12s
nsqlookupd-0   0/1       Pending   0         0s
nsqlookupd-0   0/1       Pending   0         0s
nsqlookupd-0   0/1       ContainerCreating   0         0s
nsqlookupd-0   0/1       Running   0         3s
nsqlookupd-0   1/1       Running   0         10s

典型容器日志:

❯ kubectl logs nsqd-0
[nsqd] 2016/12/05 05:21:34.666963 nsqd v0.3.8 (built w/go1.6.2)
[nsqd] 2016/12/05 05:21:34.667170 ID: 794
[nsqd] 2016/12/05 05:21:34.667200 NSQ: persisting topic/channel metadata to nsqd.794.dat
[nsqd] 2016/12/05 05:21:34.669232 TCP: listening on [::]:4150
[nsqd] 2016/12/05 05:21:34.669284 HTTP: listening on [::]:4151
[nsqd] 2016/12/05 05:21:35.896901 200 GET /ping (172.17.0.1:51322) 1.511µs
[nsqd] 2016/12/05 05:21:40.290550 200 GET /ping (172.17.0.1:51392) 2.167µs
[nsqd] 2016/12/05 05:21:40.304599 200 GET /ping (172.17.0.1:51394) 1.856µs
[nsqd] 2016/12/05 05:21:50.289018 200 GET /ping (172.17.0.1:51452) 1.865µs
[nsqd] 2016/12/05 05:21:50.299567 200 GET /ping (172.17.0.1:51454) 1.951µs
[nsqd] 2016/12/05 05:22:00.296685 200 GET /ping (172.17.0.1:51548) 2.029µs
[nsqd] 2016/12/05 05:22:00.300842 200 GET /ping (172.17.0.1:51550) 1.464µs
[nsqd] 2016/12/05 05:22:10.295596 200 GET /ping (172.17.0.1:51698) 2.206µs

关于 Kubernetes 为何不断杀死这些 pods,我完全摸不着头脑。容器本身似乎没有行为不端,而 kubernetes 本身似乎正在终止这里...

想通了。

我的服务都有相同的选择器。每个服务都匹配所有 pods,导致 Kubernetes 认为每个 运行 一次都匹配太多,所以它随机杀死了 "extras"。