ActiveMQ Artemis 消费者在崩溃后不会重新连接到同一个集群节点

Question

我正在 Kubernetes 中建立一个 Artemis 集群，有 3 组 master/slave:

activemq-artemis-master-0                               1/1     Running
activemq-artemis-master-1                               1/1     Running
activemq-artemis-master-2                               1/1     Running
activemq-artemis-slave-0                                0/1     Running
activemq-artemis-slave-1                                0/1     Running
activemq-artemis-slave-2                                0/1     Running

Artemis 版本为 2.17.0。这是我在 master-0 broker.xml 中的集群配置。除了 connector-ref 更改为匹配经纪人外，其他经纪人的配置相同：

<?xml version="1.0"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.  See the License for the
specific language governing permissions and limitations
under the License.
-->
<configuration xmlns="urn:activemq" xmlns:xi="http://www.w3.org/2001/XInclude" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:activemq /schema/artemis-configuration.xsd">
  <core xmlns="urn:activemq:core" xsi:schemaLocation="urn:activemq:core ">
    <name>activemq-artemis-master-0</name>
    <persistence-enabled>true</persistence-enabled>
    <!-- this could be ASYNCIO, MAPPED, NIO
           ASYNCIO: Linux Libaio
           MAPPED: mmap files
           NIO: Plain Java Files
       -->
    <journal-type>ASYNCIO</journal-type>
    <paging-directory>data/paging</paging-directory>
    <bindings-directory>data/bindings</bindings-directory>
    <journal-directory>data/journal</journal-directory>
    <large-messages-directory>data/large-messages</large-messages-directory>
    <journal-datasync>true</journal-datasync>
    <journal-min-files>2</journal-min-files>
    <journal-pool-files>10</journal-pool-files>
    <journal-device-block-size>4096</journal-device-block-size>
    <journal-file-size>10M</journal-file-size>
    <!--
       This value was determined through a calculation.
       Your system could perform 1.1 writes per millisecond
       on the current journal configuration.
       That translates as a sync write every 911999 nanoseconds.

       Note: If you specify 0 the system will perform writes directly to the disk.
             We recommend this to be 0 if you are using journalType=MAPPED and journal-datasync=false.
      -->
    <journal-buffer-timeout>100000</journal-buffer-timeout>
    <!--
        When using ASYNCIO, this will determine the writing queue depth for libaio.
       -->
    <journal-max-io>4096</journal-max-io>
    <!--
        You can verify the network health of a particular NIC by specifying the <network-check-NIC> element.
         <network-check-NIC>theNicName</network-check-NIC>
        -->
    <!--
        Use this to use an HTTP server to validate the network
         <network-check-URL-list>http://www.apache.org</network-check-URL-list> -->
    <!-- <network-check-period>10000</network-check-period> -->
    <!-- <network-check-timeout>1000</network-check-timeout> -->
    <!-- this is a comma separated list, no spaces, just DNS or IPs
           it should accept IPV6

           Warning: Make sure you understand your network topology as this is meant to validate if your network is valid.
                    Using IPs that could eventually disappear or be partially visible may defeat the purpose.
                    You can use a list of multiple IPs, and if any successful ping will make the server OK to continue running -->
    <!-- <network-check-list>10.0.0.1</network-check-list> -->
    <!-- use this to customize the ping used for ipv4 addresses -->
    <!-- <network-check-ping-command>ping -c 1 -t %d %s</network-check-ping-command> -->
    <!-- use this to customize the ping used for ipv6 addresses -->
    <!-- <network-check-ping6-command>ping6 -c 1 %2$s</network-check-ping6-command> -->
    <!-- how often we are looking for how many bytes are being used on the disk in ms -->
    <disk-scan-period>5000</disk-scan-period>
    <!-- once the disk hits this limit the system will block, or close the connection in certain protocols
           that won't support flow control. -->
    <max-disk-usage>90</max-disk-usage>
    <!-- should the broker detect dead locks and other issues -->
    <critical-analyzer>true</critical-analyzer>
    <critical-analyzer-timeout>120000</critical-analyzer-timeout>
    <critical-analyzer-check-period>60000</critical-analyzer-check-period>
    <critical-analyzer-policy>HALT</critical-analyzer-policy>
    <page-sync-timeout>2244000</page-sync-timeout>
    <!-- the system will enter into page mode once you hit this limit.
           This is an estimate in bytes of how much the messages are using in memory

            The system will use half of the available memory (-Xmx) by default for the global-max-size.
            You may specify a different value here if you need to customize it to your needs.

            <global-max-size>100Mb</global-max-size>

      -->
    <acceptors>
      <!-- useEpoll means: it will use Netty epoll if you are on a system (Linux) that supports it -->
      <!-- amqpCredits: The number of credits sent to AMQP producers -->
      <!-- amqpLowCredits: The server will send the # credits specified at amqpCredits at this low mark -->
      <!-- amqpDuplicateDetection: If you are not using duplicate detection, set this to false
                                      as duplicate detection requires applicationProperties to be parsed on the server. -->
      <!-- amqpMinLargeMessageSize: Determines how many bytes are considered large, so we start using files to hold their data.
                                       default: 102400, -1 would mean to disable large mesasge control -->
      <!-- Note: If an acceptor needs to be compatible with HornetQ and/or Artemis 1.x clients add
                    "anycastPrefix=jms.queue.;multicastPrefix=jms.topic." to the acceptor url.
                    See https://issues.apache.org/jira/browse/ARTEMIS-1644 for more information. -->
      <!-- Acceptor for every supported protocol -->
      <acceptor name="artemis">tcp://0.0.0.0:61616?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;amqpMinLargeMessageSize=102400;protocols=CORE,AMQP,STOMP,HORNETQ,MQTT,OPENWIRE;useEpoll=true;amqpCredits=1000;amqpLowCredits=300;amqpDuplicateDetection=true</acceptor>
      <!-- AMQP Acceptor.  Listens on default AMQP port for AMQP traffic.-->
      <acceptor name="amqp">tcp://0.0.0.0:5672?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=AMQP;useEpoll=true;amqpCredits=1000;amqpLowCredits=300;amqpMinLargeMessageSize=102400;amqpDuplicateDetection=true</acceptor>
      <!-- STOMP Acceptor. -->
      <acceptor name="stomp">tcp://0.0.0.0:61613?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=STOMP;useEpoll=true</acceptor>
      <!-- HornetQ Compatibility Acceptor.  Enables HornetQ Core and STOMP for legacy HornetQ clients. -->
      <acceptor name="hornetq">tcp://0.0.0.0:5445?anycastPrefix=jms.queue.;multicastPrefix=jms.topic.;protocols=HORNETQ,STOMP;useEpoll=true</acceptor>
      <!-- MQTT Acceptor -->
      <acceptor name="mqtt">tcp://0.0.0.0:1883?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=MQTT;useEpoll=true</acceptor>
    </acceptors>
    <security-settings>
      <security-setting match="#">
        <permission type="createNonDurableQueue" roles="amq"/>
        <permission type="deleteNonDurableQueue" roles="amq"/>
        <permission type="createDurableQueue" roles="amq"/>
        <permission type="deleteDurableQueue" roles="amq"/>
        <permission type="createAddress" roles="amq"/>
        <permission type="deleteAddress" roles="amq"/>
        <permission type="consume" roles="amq"/>
        <permission type="browse" roles="amq"/>
        <permission type="send" roles="amq"/>
        <!-- we need this otherwise ./artemis data imp wouldn't work -->
        <permission type="manage" roles="amq"/>
      </security-setting>
    </security-settings>
    <address-settings>
      <!-- if you define auto-create on certain queues, management has to be auto-create -->
      <address-setting match="activemq.management#">
        <dead-letter-address>DLQ</dead-letter-address>
        <expiry-address>ExpiryQueue</expiry-address>
        <redelivery-delay>0</redelivery-delay>
        <!-- with -1 only the global-max-size is in use for limiting -->
        <max-size-bytes>-1</max-size-bytes>
        <message-counter-history-day-limit>10</message-counter-history-day-limit>
        <address-full-policy>PAGE</address-full-policy>
        <auto-create-queues>true</auto-create-queues>
        <auto-create-addresses>true</auto-create-addresses>
        <auto-create-jms-queues>true</auto-create-jms-queues>
        <auto-create-jms-topics>true</auto-create-jms-topics>
      </address-setting>
      <!--default for catch all-->
      <address-setting match="#">
        <dead-letter-address>DLQ</dead-letter-address>
        <expiry-address>ExpiryQueue</expiry-address>
        <redelivery-delay>0</redelivery-delay>
        <!-- with -1 only the global-max-size is in use for limiting -->
        <max-size-bytes>-1</max-size-bytes>
        <message-counter-history-day-limit>10</message-counter-history-day-limit>
        <address-full-policy>PAGE</address-full-policy>
        <auto-create-queues>true</auto-create-queues>
        <auto-create-addresses>true</auto-create-addresses>
        <auto-create-jms-queues>true</auto-create-jms-queues>
        <auto-create-jms-topics>true</auto-create-jms-topics>
      </address-setting>
    </address-settings>
    <addresses>
      <address name="DLQ">
        <anycast>
          <queue name="DLQ"/>
        </anycast>
      </address>
      <address name="ExpiryQueue">
        <anycast>
          <queue name="ExpiryQueue"/>
        </anycast>
      </address>
    </addresses>
    <!-- Uncomment the following if you want to use the Standard LoggingActiveMQServerPlugin pluging to log in events
      <broker-plugins>
         <broker-plugin class-name="org.apache.activemq.artemis.core.server.plugin.impl.LoggingActiveMQServerPlugin">
            <property key="LOG_ALL_EVENTS" value="true"/>
            <property key="LOG_CONNECTION_EVENTS" value="true"/>
            <property key="LOG_SESSION_EVENTS" value="true"/>
            <property key="LOG_CONSUMER_EVENTS" value="true"/>
            <property key="LOG_DELIVERING_EVENTS" value="true"/>
            <property key="LOG_SENDING_EVENTS" value="true"/>
            <property key="LOG_INTERNAL_EVENTS" value="true"/>
         </broker-plugin>
      </broker-plugins>
      -->
    <cluster-user>clusterUser</cluster-user>
    <cluster-password>aShortclusterPassword</cluster-password>
    <connectors>
      <connector name="activemq-artemis-master-0">tcp://activemq-artemis-master-0.activemq-artemis-master.svc.cluster.local:61616</connector>
      <connector name="activemq-artemis-slave-0">tcp://activemq-artemis-slave-0.activemq-artemis-slave.svc.cluster.local:61616</connector>
      <connector name="activemq-artemis-master-1">tcp://activemq-artemis-master-1.activemq-artemis-master.svc.cluster.local:61616</connector>
      <connector name="activemq-artemis-slave-1">tcp://activemq-artemis-slave-1.activemq-artemis-slave.svc.cluster.local:61616</connector>
      <connector name="activemq-artemis-master-2">tcp://activemq-artemis-master-2.activemq-artemis-master.svc.cluster.local:61616</connector>
      <connector name="activemq-artemis-slave-2">tcp://activemq-artemis-slave-2.activemq-artemis-slave.svc.cluster.local:61616</connector>
    </connectors>
    <cluster-connections>
      <cluster-connection name="activemq-artemis">
        <connector-ref>activemq-artemis-master-0</connector-ref>
        <retry-interval>500</retry-interval>
        <retry-interval-multiplier>1.1</retry-interval-multiplier>
        <max-retry-interval>5000</max-retry-interval>
        <initial-connect-attempts>-1</initial-connect-attempts>
        <reconnect-attempts>-1</reconnect-attempts>
        <message-load-balancing>ON_DEMAND</message-load-balancing>
        <max-hops>1</max-hops>
        <!-- scale-down>true</scale-down -->
        <static-connectors>
          <connector-ref>activemq-artemis-master-0</connector-ref>
          <connector-ref>activemq-artemis-slave-0</connector-ref>
          <connector-ref>activemq-artemis-master-1</connector-ref>
          <connector-ref>activemq-artemis-slave-1</connector-ref>
          <connector-ref>activemq-artemis-master-2</connector-ref>
          <connector-ref>activemq-artemis-slave-2</connector-ref>
        </static-connectors>
      </cluster-connection>
    </cluster-connections>
    <ha-policy>
      <replication>
        <master>
          <group-name>activemq-artemis-0</group-name>
          <quorum-vote-wait>12</quorum-vote-wait>
          <vote-on-replication-failure>true</vote-on-replication-failure>
          <!--we need this for auto failback-->
          <check-for-live-server>true</check-for-live-server>
        </master>
      </replication>
    </ha-policy>
  </core>
  <core xmlns="urn:activemq:core">
    <jmx-management-enabled>true</jmx-management-enabled>
  </core>
</configuration>

我的消费者在 Spring 启动应用程序中定义为 JmsListener。在使用队列中的消息期间，Spring Boot 应用程序崩溃，导致 kubernetes 删除 pod 并重新创建一个新的 pod。但是，我注意到新的 pod 没有连接到同一个 Artemis 节点，因此之前连接的剩余消息从未被消费过。

我认为使用集群的全部意义在于让所有 Artemis 节点作为一个单元将消息传递给消费者，而不管它连接到哪个节点。我错了吗？如果集群无法将消费者连接重新路由到正确的节点（该节点保存来自先前消费者的剩余消息），那么处理这种情况的推荐方法是什么？

Answer 1

首先，重要的是要注意，在客户端 crashes/restarts 之后，没有使客户端重新连接到它断开连接的代理的功能。一般来说，客户端不应该真正关心它连接到哪个代理；这是水平可扩展性的主要目标之一。

还值得注意的是，如果代理上的消息数量和连接的客户端数量足够少，以至于这种情况经常出现，这几乎肯定意味着您的集群中有太多代理。

就是说，我认为您的客户端没有收到预期消息的原因是因为您使用的是默认值 redistribution-delay（即 -1），这意味着消息将不重新分配给集群中的其他节点。如果你想启用重新分配（这似乎是你做的）那么你应该将它设置为 >= 0，例如：

      <address-setting match="#">
        ...
        <redistribution-delay>0</redistribution-delay>
        ...
      </address-setting>

您可以在 the documentation 中阅读有关重新分配的更多信息。

除此之外，您可能需要总体上重新考虑您的拓扑结构。通常，如果您处于类似云的环境中（例如使用 Kubernetes 的环境），基础架构本身将重启失败 pods，那么您不会使用 master/slave 配置。您只需将日志挂载到单独的 pod 上（例如使用 NFSv4），这样当节点出现故障时它会重新启动，然后重新连接回其持久存储。这有效地提供了代理高可用性（这是 master/slave 为云环境之外的环境设计的）。

此外，根据用例，ActiveMQ Artemis 的单个实例每秒可以处理数百万条消息，因此您实际上可能不需要 3 个活动节点来满足预期负载。

请注意，这些是关于您的整体架构的一般建议，与您的问题没有直接关系。

ActiveMQ Artemis 消费者在崩溃后不会重新连接到同一个集群节点

ActiveMQ Artemis consumer does not reconnect to the same cluster node after crash

spring-jms

kubernetes

activemq-artemis

artemiscloud