Kubernetes Pods 上的 CockroachDB 集群崩溃
CockroachDB Cluster on Kubernetes Pods Crashing
我正在尝试使用以下命令在 2 节点 Kubernetes 集群上安装 CockroachDB Helm 图表:
helm install my-release --set statefulset.replicas=2 stable/cockroachdb
我已经创建了 2 个持久卷:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pv00001 100Gi RWO Recycle Bound default/datadir-my-release-cockroachdb-0 11m
pv00002 100Gi RWO Recycle Bound default/datadir-my-release-cockroachdb-1 11m
我遇到了一个奇怪的错误,我是 Kubernetes 的新手,所以我不确定自己做错了什么。我已经尝试创建一个 StorageClass 并将其与我的 PV 一起使用,但是 CockroachDB PVC 不会绑定到它们。我怀疑我的 PV 设置可能有问题?
我试过使用 kubectl logs
,但我看到的唯一错误是:
standard_init_linux.go:211: exec user process caused "exec format
error"
和 pods 一次又一次地崩溃:
NAME READY STATUS RESTARTS AGE
my-release-cockroachdb-0 0/1 Pending 0 11m
my-release-cockroachdb-1 0/1 CrashLoopBackOff 7 11m
my-release-cockroachdb-init-tfcks 0/1 CrashLoopBackOff 5 5m29s
知道 pods 崩溃的原因吗?
这是 init
连播的 kubectl describe
:
Name: my-release-cockroachdb-init-tfcks
Namespace: default
Priority: 0
Node: axon/192.168.1.7
Start Time: Sat, 04 Apr 2020 00:22:19 +0100
Labels: app.kubernetes.io/component=init
app.kubernetes.io/instance=my-release
app.kubernetes.io/name=cockroachdb
controller-uid=54c7c15d-eb1c-4392-930a-d9b8e9225a45
job-name=my-release-cockroachdb-init
Annotations: <none>
Status: Running
IP: 10.44.0.1
IPs:
IP: 10.44.0.1
Controlled By: Job/my-release-cockroachdb-init
Containers:
cluster-init:
Container ID: docker://82a062c6862a9fd5047236feafe6e2654ec1f6e3064fd0513341a1e7f36eaed3
Image: cockroachdb/cockroach:v19.2.4
Image ID: docker-pullable://cockroachdb/cockroach@sha256:511b6d09d5bc42c7566477811a4e774d85d5689f8ba7a87a114b96d115b6149b
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
while true; do initOUT=$(set -x; /cockroach/cockroach init --insecure --host=my-release-cockroachdb-0.my-release-cockroachdb:26257 2>&1); initRC="$?"; echo $initOUT; [[ "$initRC" == "0" ]] && exit 0; [[ "$initOUT" == *"cluster has already been initialized"* ]] && exit 0; sleep 5; done
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Sat, 04 Apr 2020 00:28:04 +0100
Finished: Sat, 04 Apr 2020 00:28:04 +0100
Ready: False
Restart Count: 6
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-cz2sn (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-cz2sn:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-cz2sn
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned default/my-release-cockroachdb-init-tfcks to axon
Normal Pulled 5m9s (x5 over 6m45s) kubelet, axon Container image "cockroachdb/cockroach:v19.2.4" already present on machine
Normal Created 5m8s (x5 over 6m45s) kubelet, axon Created container cluster-init
Normal Started 5m8s (x5 over 6m44s) kubelet, axon Started container cluster-init
Warning BackOff 92s (x26 over 6m42s) kubelet, axon Back-off restarting failed container
2 节点 CockroachDB 集群是 anti-pattern. You need 3 or more nodes to avoid data or cluster-wide unavailability when a single node fails. Consider checking out these videos explaining how data in CockroachDB is organized and then how the nodes in a cluster work together to keep data available in the face of node failure。
当 Pods 崩溃时,最重要的故障排除是他们的描述(kubectl describe
)和日志。
故障 Pod 的日志显示蟑螂图像的拱形与节点不匹配。
运行 kubectl get po -o wide
获取蟑螂运行的节点并检查它们的拱形。
只有当您有 3 个节点(或更多)时,如果任何笔记被损坏,您才不会有丢失数据的风险。除此之外,解释如何做对比找出哪里出了问题更容易,要找出哪里出了问题,就必须通过日志。
如果你附上日志,我可以看看。
我也写了一篇detailed guide that may address the "doing it right" part of my answer. I elaborated even more about the entire process here.
我正在尝试使用以下命令在 2 节点 Kubernetes 集群上安装 CockroachDB Helm 图表:
helm install my-release --set statefulset.replicas=2 stable/cockroachdb
我已经创建了 2 个持久卷:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pv00001 100Gi RWO Recycle Bound default/datadir-my-release-cockroachdb-0 11m
pv00002 100Gi RWO Recycle Bound default/datadir-my-release-cockroachdb-1 11m
我遇到了一个奇怪的错误,我是 Kubernetes 的新手,所以我不确定自己做错了什么。我已经尝试创建一个 StorageClass 并将其与我的 PV 一起使用,但是 CockroachDB PVC 不会绑定到它们。我怀疑我的 PV 设置可能有问题?
我试过使用 kubectl logs
,但我看到的唯一错误是:
standard_init_linux.go:211: exec user process caused "exec format error"
和 pods 一次又一次地崩溃:
NAME READY STATUS RESTARTS AGE
my-release-cockroachdb-0 0/1 Pending 0 11m
my-release-cockroachdb-1 0/1 CrashLoopBackOff 7 11m
my-release-cockroachdb-init-tfcks 0/1 CrashLoopBackOff 5 5m29s
知道 pods 崩溃的原因吗?
这是 init
连播的 kubectl describe
:
Name: my-release-cockroachdb-init-tfcks
Namespace: default
Priority: 0
Node: axon/192.168.1.7
Start Time: Sat, 04 Apr 2020 00:22:19 +0100
Labels: app.kubernetes.io/component=init
app.kubernetes.io/instance=my-release
app.kubernetes.io/name=cockroachdb
controller-uid=54c7c15d-eb1c-4392-930a-d9b8e9225a45
job-name=my-release-cockroachdb-init
Annotations: <none>
Status: Running
IP: 10.44.0.1
IPs:
IP: 10.44.0.1
Controlled By: Job/my-release-cockroachdb-init
Containers:
cluster-init:
Container ID: docker://82a062c6862a9fd5047236feafe6e2654ec1f6e3064fd0513341a1e7f36eaed3
Image: cockroachdb/cockroach:v19.2.4
Image ID: docker-pullable://cockroachdb/cockroach@sha256:511b6d09d5bc42c7566477811a4e774d85d5689f8ba7a87a114b96d115b6149b
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
while true; do initOUT=$(set -x; /cockroach/cockroach init --insecure --host=my-release-cockroachdb-0.my-release-cockroachdb:26257 2>&1); initRC="$?"; echo $initOUT; [[ "$initRC" == "0" ]] && exit 0; [[ "$initOUT" == *"cluster has already been initialized"* ]] && exit 0; sleep 5; done
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Sat, 04 Apr 2020 00:28:04 +0100
Finished: Sat, 04 Apr 2020 00:28:04 +0100
Ready: False
Restart Count: 6
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-cz2sn (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-cz2sn:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-cz2sn
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned default/my-release-cockroachdb-init-tfcks to axon
Normal Pulled 5m9s (x5 over 6m45s) kubelet, axon Container image "cockroachdb/cockroach:v19.2.4" already present on machine
Normal Created 5m8s (x5 over 6m45s) kubelet, axon Created container cluster-init
Normal Started 5m8s (x5 over 6m44s) kubelet, axon Started container cluster-init
Warning BackOff 92s (x26 over 6m42s) kubelet, axon Back-off restarting failed container
2 节点 CockroachDB 集群是 anti-pattern. You need 3 or more nodes to avoid data or cluster-wide unavailability when a single node fails. Consider checking out these videos explaining how data in CockroachDB is organized and then how the nodes in a cluster work together to keep data available in the face of node failure。
当 Pods 崩溃时,最重要的故障排除是他们的描述(kubectl describe
)和日志。
故障 Pod 的日志显示蟑螂图像的拱形与节点不匹配。
运行 kubectl get po -o wide
获取蟑螂运行的节点并检查它们的拱形。
只有当您有 3 个节点(或更多)时,如果任何笔记被损坏,您才不会有丢失数据的风险。除此之外,解释如何做对比找出哪里出了问题更容易,要找出哪里出了问题,就必须通过日志。
如果你附上日志,我可以看看。
我也写了一篇detailed guide that may address the "doing it right" part of my answer. I elaborated even more about the entire process here.