Rabbit mq - 等待 Mnesia 表时出错

Rabbit mq - Error while waiting for Mnesia tables

我已经在 kubernetes 集群上使用 helm chart 安装了 rabbitmq。 rabbitmq pod 不断重启。在检查 pod 日志时,我收到以下错误

2020-02-26 04:42:31.582 [warning] <0.314.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-02-26 04:42:31.582 [info] <0.314.0> Waiting for Mnesia tables for 30000 ms, 6 retries left

当我尝试执行 kubectl describe pod 时出现此错误

Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-rabbitmq-0
    ReadOnly:   false
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      rabbitmq-config
    Optional:  false
  healthchecks:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      rabbitmq-healthchecks
    Optional:  false
  rabbitmq-token-w74kb:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  rabbitmq-token-w74kb
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/arch=amd64
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                      From                                               Message
  ----     ------     ----                     ----                                               -------
  Warning  Unhealthy  3m27s (x878 over 7h21m)  kubelet, gke-analytics-default-pool-918f5943-w0t0  Readiness probe failed: Timeout: 70 seconds ...
Checking health of node rabbit@rabbitmq-0.rabbitmq-headless.default.svc.cluster.local ...
Status of node rabbit@rabbitmq-0.rabbitmq-headless.default.svc.cluster.local ...
Error:
{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :"", :_, :_}, [], [:""]}]]}}
Error:
{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :"", :_, :_}, [], [:""]}]]}}

我已经在 kubernetes 集群上的 Google Cloud 上配置了以上内容。我不确定它是在什么具体情况下开始失败的。我不得不重新启动 pod,从那以后它一直在失败。

这里有什么问题?

刚刚删除了现有的持久卷声明并重新安装了 rabbitmq,它开始工作了。

所以每次在 kubernetes 集群上安装 rabbitmq 之后,如果我将 pods 缩小到 0,当我稍后放大 pods 时,我会得到同样的错误。我还尝试在不卸载 rabbitmq helm chart 的情况下删除 Persistent Volume Claim,但仍然出现相同的错误。

看来每次我将集群缩小到0时,我都需要卸载rabbitmq helm chart,删除相应的Persistent Volume Claims并安装rabbitmq helm chart才能正常工作。

我也遇到了类似的错误,如下所示。

2020-06-05 03:45:37.153 [info] <0.234.0> Waiting for Mnesia tables for 30000 ms, 9 retries left 2020-06-05 03:46:07.154 [warning] <0.234.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]} 2020-06-05 03:46:07.154 [info] <0.234.0> Waiting for Mnesia tables for 30000 ms, 8 retries left

在我的例子中,RabbitMQ 集群的从节点(服务器)宕机了。一旦我启动了从节点,主节点就没有错误地启动了。

测试此部署:

kind: Service
apiVersion: v1
metadata:
  namespace: rabbitmq-namespace
  name: rabbitmq
  labels:
    app: rabbitmq
    type: LoadBalancer  
spec:
  type: NodePort
  ports:
   - name: http
     protocol: TCP
     port: 15672
     targetPort: 15672
     nodePort: 31672
   - name: amqp
     protocol: TCP
     port: 5672
     targetPort: 5672
     nodePort: 30672
   - name: stomp
     protocol: TCP
     port: 61613
     targetPort: 61613
  selector:
    app: rabbitmq
---
kind: Service 
apiVersion: v1
metadata:
  namespace: rabbitmq-namespace
  name: rabbitmq-lb
  labels:
    app: rabbitmq
spec:
  # Headless service to give the StatefulSet a DNS which is known in the cluster (hostname-#.app.namespace.svc.cluster.local, )
  # in our case - rabbitmq-#.rabbitmq.rabbitmq-namespace.svc.cluster.local  
  clusterIP: None
  ports:
   - name: http
     protocol: TCP
     port: 15672
     targetPort: 15672
   - name: amqp
     protocol: TCP
     port: 5672
     targetPort: 5672
   - name: stomp
     port: 61613
  selector:
    app: rabbitmq
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: rabbitmq-config
  namespace: rabbitmq-namespace
data:
  enabled_plugins: |
      [rabbitmq_management,rabbitmq_peer_discovery_k8s,rabbitmq_stomp].

  rabbitmq.conf: |
      ## Cluster formation. See http://www.rabbitmq.com/cluster-formation.html to learn more.
      cluster_formation.peer_discovery_backend  = rabbit_peer_discovery_k8s
      cluster_formation.k8s.host = kubernetes.default.svc.cluster.local
      ## Should RabbitMQ node name be computed from the pod's hostname or IP address?
      ## IP addresses are not stable, so using [stable] hostnames is recommended when possible.
      ## Set to "hostname" to use pod hostnames.
      ## When this value is changed, so should the variable used to set the RABBITMQ_NODENAME
      ## environment variable.
      cluster_formation.k8s.address_type = hostname   
      ## Important - this is the suffix of the hostname, as each node gets "rabbitmq-#", we need to tell what's the suffix
      ## it will give each new node that enters the way to contact the other peer node and join the cluster (if using hostname)
      cluster_formation.k8s.hostname_suffix = .rabbitmq.rabbitmq-namespace.svc.cluster.local
      ## How often should node cleanup checks run?
      cluster_formation.node_cleanup.interval = 30
      ## Set to false if automatic removal of unknown/absent nodes
      ## is desired. This can be dangerous, see
      ##  * http://www.rabbitmq.com/cluster-formation.html#node-health-checks-and-cleanup
      ##  * https://groups.google.com/forum/#!msg/rabbitmq-users/wuOfzEywHXo/k8z_HWIkBgAJ
      cluster_formation.node_cleanup.only_log_warning = true
      cluster_partition_handling = autoheal
      ## See http://www.rabbitmq.com/ha.html#master-migration-data-locality
      queue_master_locator=min-masters
      ## See http://www.rabbitmq.com/access-control.html#loopback-users
      loopback_users.guest = false
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: rabbitmq
  namespace: rabbitmq-namespace
spec:
  serviceName: rabbitmq
  replicas: 3
  selector:
    matchLabels:
      name: rabbitmq
  template:
    metadata:
      labels:
        app: rabbitmq
        name: rabbitmq
        state: rabbitmq
      annotations:
        pod.alpha.kubernetes.io/initialized: "true"
    spec:
      serviceAccountName: rabbitmq
      terminationGracePeriodSeconds: 10
      containers:        
      - name: rabbitmq-k8s
        image: rabbitmq:3.8.3
        volumeMounts:
          - name: config-volume
            mountPath: /etc/rabbitmq
          - name: data
            mountPath: /var/lib/rabbitmq/mnesia
        ports:
          - name: http
            protocol: TCP
            containerPort: 15672
          - name: amqp
            protocol: TCP
            containerPort: 5672
        livenessProbe:
          exec:
            command: ["rabbitmqctl", "status"]
          initialDelaySeconds: 60
          periodSeconds: 60
          timeoutSeconds: 10
        resources:
            requests:
              memory: "0"
              cpu: "0"
            limits:
              memory: "2048Mi"
              cpu: "1000m"
        readinessProbe:
          exec:
            command: ["rabbitmqctl", "status"]
          initialDelaySeconds: 20
          periodSeconds: 60
          timeoutSeconds: 10
        imagePullPolicy: Always
        env:
          - name: MY_POD_IP
            valueFrom:
              fieldRef:
                fieldPath: status.podIP
          - name: NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
          - name: HOSTNAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: RABBITMQ_USE_LONGNAME
            value: "true"
          # See a note on cluster_formation.k8s.address_type in the config file section
          - name: RABBITMQ_NODENAME
            value: "rabbit@$(HOSTNAME).rabbitmq.$(NAMESPACE).svc.cluster.local"
          - name: K8S_SERVICE_NAME
            value: "rabbitmq"
          - name: RABBITMQ_ERLANG_COOKIE
            value: "mycookie"      
      volumes:
        - name: config-volume
          configMap:
            name: rabbitmq-config
            items:
            - key: rabbitmq.conf
              path: rabbitmq.conf
            - key: enabled_plugins
              path: enabled_plugins
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes:
        - "ReadWriteOnce"
      storageClassName: "default"
      resources:
        requests:
          storage: 3Gi

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: rabbitmq 
  namespace: rabbitmq-namespace 
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: endpoint-reader
  namespace: rabbitmq-namespace 
rules:
- apiGroups: [""]
  resources: ["endpoints"]
  verbs: ["get"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: endpoint-reader
  namespace: rabbitmq-namespace
subjects:
- kind: ServiceAccount
  name: rabbitmq
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: endpoint-reader

TLDR

helm upgrade rabbitmq --set clustering.forceBoot=true

问题

出现问题的原因如下:

  • 所有 RMQ pods 由于某种原因同时终止(可能是因为您将 StatefulSet 副本显式设置为 0,或其他原因)
  • 其中一个是最后一个停止的(可能只比其他人晚一点)。它将这个条件(“我现在是独立的”)存储在它的文件系统中,在 k8s 中是 PersistentVolume(Claim)。假设这个 pod 是 rabbitmq-1。
  • 当您重新启动 StatefulSet 时,pod rabbitmq-0 始终是第一个启动的(参见 here)。
  • 在启动期间,pod rabbitmq-0 首先检查它是否应该 运行 独立。但就它在自己的文件系统上所见而言,它是集群的一部分。所以它检查它的同行并没有找到任何。这个results in a startup failure by default.
  • rabbitmq-0 因此永远不会就绪。
  • rabbitmq-1 永远不会启动,因为这就是 StatefulSet 的部署方式 - 一个接一个。如果要启动,它会成功启动,因为它发现它也可以 运行 独立运行。

所以最后,RabbitMQ 和 StatefulSets 的工作方式有点不匹配。 RMQ 说:“如果一切都发生故障,只需同时启动一切,一个就可以启动,一旦这个启动,其他人就可以重新加入集群。” k8s StatefulSets 说:“不可能同时启动所有内容,我们将从 0 开始”。

解决方案

为了解决这个问题,有一个force_boot command for rabbitmqctl which basically tells an instance to start standalone if it doesn't find any peers. How you can use this from Kubernetes depends on the Helm chart and container you're using. In the Bitnami Chart, which uses the Bitnami Docker image,有一个值clustering.forceBoot = true,转换为容器中的环境变量RABBITMQ_FORCE_BOOT = yes,然后会发出上面的命令给你。

但是查看问题,您还可以了解为什么删除 PVC 会起作用 ()。 pods 将全部“忘记”他们上一次是 RMQ 集群的一部分,并愉快地开始。不过,我更喜欢上述解决方案,因为没有数据丢失。

在我的案例中,解决方案很简单

第一步:缩减statefulset,它不会删除PVC。

kubectl scale statefulsets rabbitmq-1-rabbitmq --namespace teps-rabbitmq --replicas=1

第 2 步:访问 RabbitMQ Pod。

kubectl exec -it rabbitmq-1-rabbitmq-0 -n Rabbit

第三步:重置集群

rabbitmqctl stop_app
rabbitmqctl force_boot

第 4 步:重新缩放 statefulset

  kubectl scale statefulsets rabbitmq-1-rabbitmq --namespace teps-rabbitmq --replicas=4

如果你和我处于相同的场景,并且你不知道谁部署了 helm chart 以及它是如何部署的......你可以直接编辑 statefulset 以避免弄乱更多的东西..

我能够在不删除 helm_chart

的情况下使其工作

kubectl -n rabbitmq edit statefulsets.apps rabbitmq

在规范部分下,我添加了如下环境变量 RABBITMQ_FORCE_BOOT = yes:

    spec:
      containers:
      - env:
        - name: RABBITMQ_FORCE_BOOT # New Line 1 Added
          value: "yes"              # New Line 2 Added

这也应该可以解决问题...请首先按照 Ulli 上文所述以正确的方式进行操作。