我需要 Kafka 有可靠的 Storm spout 吗?
Do I need Kafka to have a reliable Storm spout?
据我了解,ZooKeeper 将保留由 bolts 发出的元组,因此如果一个 bolts 崩溃(或带有 bolts 的计算机崩溃,或整个集群崩溃),由 bolts 发出的元组不会丢失。一旦一切重新启动,元组将从 ZooKeeper 中获取,一切将继续进行,就好像没有发生任何坏事一样。
我还不明白的是,spout 是否也是如此。如果一个 spout 发出一个元组(即执行 spout 中的 emit()
函数),并且 spout 所在的计算机 运行 此后不久就崩溃了,这个元组会被 ZooKeeper 复活吗?还是我们需要 Kafka 来保证这一点?
P.S。我知道 spout 发出的元组必须在 emit()
.
的调用中分配一个唯一的 ID
P.P.S。我在书中看到示例代码使用 ConcurrentHashMap<UUID, Values>
之类的东西来跟踪哪些 spouted 元组尚未被确认。这是否以某种方式自动与 ZooKeeper 保持一致?如果不是,那么我真的不应该那样做,不是吗?我应该做什么呢?使用卡夫卡?
Florian Hussonnois 在 this storm-user 线程中彻底而清楚地回答了我的问题。这是他的回答:
Actually, the tuples aren't persisted into "zookeeper". If your
"spout" emits a tuple with a unique id, it will be automatically
follow internally by storm (i.e ackers) . Thus, in case the emitted
tuple comes to fail because of a bolt failure, Storm invokes the
method 'fail' on the origin spout task with the unique id as argument.
It's then up to you to re-emit the failed tuple.
In sample codes, spouts use a Map to track which tuples are fully
processed by your entire topology in order to be able to re-emit in
case of a bolt failure.
However, if the failure doesn't come from a bolt but from your spout,
the in memory Map will be lost and your topology will not be able to
remit failed tuples.
For a such scenario you can rely on Kafka. In fact, the Kafka Spout
store its read offset into zookeeper. In that way, if a spout task
goes down it will be able to read its offset from zookeeper after
restarting.
据我了解,ZooKeeper 将保留由 bolts 发出的元组,因此如果一个 bolts 崩溃(或带有 bolts 的计算机崩溃,或整个集群崩溃),由 bolts 发出的元组不会丢失。一旦一切重新启动,元组将从 ZooKeeper 中获取,一切将继续进行,就好像没有发生任何坏事一样。
我还不明白的是,spout 是否也是如此。如果一个 spout 发出一个元组(即执行 spout 中的 emit()
函数),并且 spout 所在的计算机 运行 此后不久就崩溃了,这个元组会被 ZooKeeper 复活吗?还是我们需要 Kafka 来保证这一点?
P.S。我知道 spout 发出的元组必须在 emit()
.
P.P.S。我在书中看到示例代码使用 ConcurrentHashMap<UUID, Values>
之类的东西来跟踪哪些 spouted 元组尚未被确认。这是否以某种方式自动与 ZooKeeper 保持一致?如果不是,那么我真的不应该那样做,不是吗?我应该做什么呢?使用卡夫卡?
Florian Hussonnois 在 this storm-user 线程中彻底而清楚地回答了我的问题。这是他的回答:
Actually, the tuples aren't persisted into "zookeeper". If your "spout" emits a tuple with a unique id, it will be automatically follow internally by storm (i.e ackers) . Thus, in case the emitted tuple comes to fail because of a bolt failure, Storm invokes the method 'fail' on the origin spout task with the unique id as argument.
It's then up to you to re-emit the failed tuple.
In sample codes, spouts use a Map to track which tuples are fully processed by your entire topology in order to be able to re-emit in case of a bolt failure.
However, if the failure doesn't come from a bolt but from your spout, the in memory Map will be lost and your topology will not be able to remit failed tuples.
For a such scenario you can rely on Kafka. In fact, the Kafka Spout store its read offset into zookeeper. In that way, if a spout task goes down it will be able to read its offset from zookeeper after restarting.