K8ssandra 部署 "Error connecting to Node (endPoint=/tmp/cassandra.sock)"

K8ssandra deployment "Error connecting to Node (endPoint=/tmp/cassandra.sock)"

我正在尝试 运行 K8ssandra,但 Cassandra 容器一直失败并显示以下消息(一遍又一遍地重复):

WARN  [epollEventLoopGroup-374-2] 2021-12-30 23:54:23,711 AbstractBootstrap.java:452 - Unknown channel option 'TCP_NODELAY' for channel '[id: 0x7cf79bf5]'
WARN  [epollEventLoopGroup-374-2] 2021-12-30 23:54:23,712 Loggers.java:39 - [s369] Error connecting to Node(endPoint=/tmp/cassandra.sock, hostId=null, hashCode=7ec5e39e), trying next node (FileNotFoundException: null)
INFO  [nioEventLoopGroup-2-1] 2021-12-30 23:54:23,713 Cli.java:617 - address=/100.97.28.180:53816 url=/api/v0/metadata/endpoints status=500 Internal Server Error

并且来自 server-system-logger 容器:

tail: cannot open '/var/log/cassandra/system.log' for reading: No such file or directory

最后,在 cass-operator 连播中:

2021-12-30T23:56:22.580Z    INFO    controllers.CassandraDatacenter incorrect status code when calling Node Management Endpoint {"cassandradatacenter": "default/dc1", "requestNamespace": "default", "requestName": "dc1", "loopID": "d1f81abc-6b68-4e63-9e95-1c2b5f6d4e9d", "namespace": "default", "datacenterName": "dc1", "clusterName": "mydomaincom", "statusCode": 500, "pod": "100.122.58.236"}
2021-12-30T23:56:22.580Z    ERROR   controllers.CassandraDatacenter Could not get endpoints data    {"cassandradatacenter": "default/dc1", "requestNamespace": "default", "requestName": "dc1", "loopID": "d1f81abc-6b68-4e63-9e95-1c2b5f6d4e9d", "namespace": "default", "datacenterName": "dc1", "clusterName": "mydomaincom", "error": "incorrect status code of 500 when calling endpoint"}

不太确定这里发生了什么。它在本地 minikube 集群上使用相同的配置工作正常,但我似乎无法让它在我的 AWS 集群上工作 (运行ning kubernetes v1.20.10)

所有其他 pods 运行 都很好。

NAME                                                    READY   STATUS    RESTARTS   AGE
mydomaincom-dc1-rac1-sts-0                              2/3     Running   0          17m
k8ssandra-cass-operator-8675f58b89-qt2dx                1/1     Running   0          29m
k8ssandra-medusa-operator-589995d979-rnjhr              1/1     Running   0          29m
k8ssandra-reaper-operator-5d9d5d975d-c6nhv              1/1     Running   0          29m

pod 事件显示:

Warning  Unhealthy               109s (x88 over 16m)  kubelet                  Readiness probe failed: HTTP probe failed with statuscode: 500

我的 values.yaml(使用 Helm3 部署):

cassandra:
  enabled: true
  version: "4.0.1"
  versionImageMap:
    3.11.7: k8ssandra/cass-management-api:3.11.7-v0.1.33
    3.11.8: k8ssandra/cass-management-api:3.11.8-v0.1.33
    3.11.9: k8ssandra/cass-management-api:3.11.9-v0.1.27
    3.11.10: k8ssandra/cass-management-api:3.11.10-v0.1.27
    3.11.11: k8ssandra/cass-management-api:3.11.11-v0.1.33
    4.0.0: k8ssandra/cass-management-api:4.0.0-v0.1.33
    4.0.1: k8ssandra/cass-management-api:4.0.1-v0.1.33

  clusterName: "mydomain.com"

  auth:
    enabled: true
    superuser:
      secret: ""
      username: ""

  cassandraLibDirVolume:
    storageClass: default
    size: 100Gi

  encryption:
    keystoreSecret:
    keystoreMountPath:
    truststoreSecret:
    truststoreMountPath:

  additionalSeeds: []

  heap: {}
  resources:
    requests:
      memory: 4Gi
      cpu: 500m
    limits:
      memory: 4Gi
      cpu: 1000m

  datacenters:
    -
      name: dc1
      size: 1
      racks:
        - name: rac1        
      heap: {}

  ingress:
    enabled: false

stargate:
  enabled: false
  
reaper:
  autoschedule: true
  enabled: true
  cassandraUser:
    secret: ""
    username: ""
  jmx:
    secret: ""
    username: ""

medusa:
  enabled: true
  image:
    registry: docker.io
    repository: k8ssandra/medusa
    tag: 0.11.3
  cassandraUser:
    secret: ""
    username: ""
  storage_properties: 
    region: us-east-1
  bucketName: my-bucket-name
  storageSecret: medusa-bucket-key

reaper-operator:
  enabled: true

monitoring:
  grafana:
    provision_dashboards: false
  prometheus:
    provision_service_monitors: false
kube-prometheus-stack:
  enabled: false
  prometheusOperator:
    enabled: false
    serviceMonitor:
      selfMonitor: false
  prometheus:
    enabled: false
  grafana:
    enabled: false

我可以通过将内存增加到 12Gi

来解决这个问题