Storm 拓扑处理逐渐变慢

Question

我一直在阅读有关 apache 的文章 Storm 尝试了 storm-starter 中的几个示例。还了解了如何调整拓扑以及如何扩展它以足够快地执行以满足所需的吞吐量。

我已经创建了启用确认的示例拓扑，我能够实现每秒 3K-5K 消息处理。它在最初的 10 到 15 分钟或大约 100 万到 200 万条消息中执行得非常快，然后开始变慢。在风暴 UI 上，我可以看到整体延迟开始逐渐上升并且不会恢复，一段时间后处理下降到只有几百秒。对于我尝试过的所有类型，我得到了完全相同的行为，最简单的一种是使用 KafkaSpout 从 kafka 读取并将其发送到 transform bolt parse msg，然后使用 KafkaBolt 再次将其发送到 kafka。解析器非常快，因为解析消息所需的时间不到一毫秒。我尝试了 increasing/describing 并行性、更改缓冲区大小等几个选项，但行为相同。请帮我找出拓扑逐渐变慢的原因。这是我正在使用的配置

1 Nimbus machine (4 CPU) 24GB RAM
2 Supervisor machines (8CPU) and using 1 thread per core with 24GB RAM
4 Node kafka cluster running on above 2 supervisor machines (each topic has 4 partitions)

KafkaSpout(2 parallelism)-->TransformerBolt(8)-->KafkaBolt(2)

topology.executor.receive.buffer.size: 65536
topology.executor.send.buffer.size: 65536
topology.spout.max.batch.size: 65536
topology.transfer.buffer.size: 32
topology.receiver.buffer.size: 8
topology.max.spout.pending: 250

一开始

几分钟后

45 分钟后 - 延迟开始上升

80 分钟后 - 延迟将继续上升，并在达到 8 到 1000 万条消息时持续到 100 秒

可视化虚拟机截图

线程

Answer 1

注意RT_LEFT_BOLT上的capacity指标，非常接近1；这解释了为什么您的拓扑变慢了。

来自Storm documentation：

The Storm UI has also been made significantly more useful. There are new stats "#executed", "execute latency", and "capacity" tracked for all bolts. The "capacity" metric is very useful and tells you what % of the time in the last 10 minutes the bolt spent executing tuples. If this value is close to 1, then the bolt is "at capacity" and is a bottleneck in your topology. The solution to at-capacity bolts is to increase the parallelism of that bolt.

因此，您的解决方案是向给定的螺栓 (RT_LEFT_BOLT) 添加更多执行程序（和任务）。您可以做的另一件事是减少 RT_RIGHT_BOLT 上的执行程序数量，容量表明您不需要那么多执行程序，可能 1 或 2 个就可以完成这项工作。

Answer 2

问题是由于 GC 设置了 newgen 参数，它没有完全使用分配的堆，所以内部风暴队列变满并且运行内存不足。奇怪的是，storm 并没有抛出内存不足错误，它只是停滞了，在 visual vm 的帮助下我能够追踪到它。

Storm 拓扑处理逐渐变慢

Storm topology processing slowing down gradually

apache-storm