Rabbitmq 集群在创建队列时崩溃

Question

你好，我有一个问题，看了大概2天没能解决，所以我会写在这里，尽可能清楚，这样对其他人也有帮助。

场景是：

我们有一个应用程序可以使用 Rabbitmq 集群处理大约 20 万台设备，这些设备采用 amqp 协议。
我们考虑过 1 个具有 200k 队列的 Exchange，每个设备有大约 6 个“路由密钥”。
这些队列需要持久且惰性，因为我们不想丢失任何消息。
我们正在使用镜像节点，因为我们需要 HA。

测试：

我创建了一个有 5 个节点和复制 2 的集群

    "definition": {
            "ha-mode": "exactly",
            "ha-params": 2,
            "ha-sync-mode": "automatic",
            "ha-sync-batch-size": 1
          }

我还使用路由键创建了 50k 持久的惰性队列。

def create_one_queue(queue_name, threadName, channel):
    channel.queue_declare(queue=queue_name, durable=True, arguments={'x-queue-mode': 'lazy'})
    for bind in BINDINGS:
        channel.queue_bind(exchange=EXCHANGE, queue=queue_name, routing_key=bind.format(queue_name))
    print("[{}]Created Queue {}".format(threadName, queue_name))

def create_queues(threadName, base):
    channel = get_channel()
    for i in range(0, 1000):
        try:
            queue_name = str(i + base)
            create_one_queue(queue_name, threadName, channel)
        except Exception as e:
            print(e)

3. 当我试图保持增长并达到 200k 节点时开始崩溃而没有运行资源不足。

链接

我已经看过以下帖子：

https://www.rabbitmq.com/ha.html#ways-to-configure

https://www.cloudamqp.com/blog/2018-01-09-part3-rabbitmq-best-practice-for-high-availability.html

RabbitMQ - How many queues RabbitMQ can handle on a single server?

https://serverfault.com/questions/378165/rabbitmq-reasonable-performance-scale-expectations

http://rabbitmq.1065348.n5.nabble.com/How-many-queues-can-one-broker-support-td21539.html

https://www.quora.com/RabbitMQ/Can-rabbitMQ-or-zeroMQ-handle-1mil-queues

但我看到了矛盾（cloudamqp建议使用很少的队列，但在其他地方说你可能会达到1M队列）

问题

如果我没有耗尽资源，集群怎么可能开始崩溃？
我的方法有错吗？
有什么改进我的集群配置的建议吗？

非常感谢

Answer 1

好的，我将根据目前的发现结果回答我的问题：

1) 当我使用 Kubernetes 和 Helm 部署集群时，我在 pods 中施加了很大的内存压力，没有为垃圾收集器留下可用的 space。 https://www.rabbitmq.com/memory-use.html#queue-memory-usage-gc

High memory watermark blocks publishers and prevents new messages from being enqueued. Since garbage collection can double the memory used by a queue, it is unsafe to set the high memory watermark above 0.5. The default high memory watermark is set to 0.4 since this is safer as not all memory is used by queues. This is entirely workload specific, which differs across RabbitMQ deployments.

2) 看起来还可以。

3) 为了创建 200k 持久和惰性队列，我不得不使用一个由 10 个节点组成的集群，每个节点具有 8 个 vCPU 和 30 GB RAM。

注意：我会在调整集群时及时更新此答案。

Rabbitmq 集群在创建队列时崩溃

Rabbitmq cluster crashing when creating queues

amqp

rabbitmq

messagebroker

rabbitmq-exchange

场景是：

测试：

链接

问题