简单转发器的 Kafka Stream 交付语义

Kafka Stream delivery semantic for a simple forwarder

我得到了一个无状态的 Kafka Stream,它从一个主题中消费并发布到 forEach 内的不同队列(Cloud PubSub)中。拓扑结构不会以产生新的 Kafka 主题而结束。

我怎么知道我可以保证哪种交付语义?知道它只是一个消息转发器,没有反序列化或任何其他转换或任何应用:是否存在我可能有重复或遗漏消息的情况?

我正在考虑以下场景以及对如何提交偏移量的相关影响:

谢谢大家

如果您考虑 Kafka Stream 应用程序通常创建的 kafka 到 kafka 循环,请设置 属性:

processing.guarantee=exactly_once

具有exactly-once语义就足够了,当然在失败场景中也是如此。

在幕后,Kafka 使用事务来保证 consume - process - produce - commit offset 处理以全有或全无保证执行。

用 exaclty once 语义 kafka 编写接收器连接器到 Google PubSub,意味着解决 same issues Kafka 已经解决了 kafka 到 kafka 的场景。

  1. The producer.send() could result in duplicate writes of message B due to internal retries. This is addressed by the idempotent producer and is not the focus of the rest of this post.
  2. We may reprocess the input message A, resulting in duplicate B messages being written to the output, violating the exactly once processing semantics. Reprocessing may happen if the stream processing application crashes after writing B but before marking A as consumed. Thus when it resumes, it will consume A again and write B again, causing a duplicate.
  3. Finally, in distributed environments, applications will crash or—worse!temporarily lose connectivity to the rest of the system. Typically, new instances are automatically started to replace the ones which were deemed lost. Through this process, we may have multiple instances processing the same input topics and writing to the same output topics, causing duplicate outputs and violating the exactly once processing semantics. We call this the problem of “zombie instances.”

假设您的 Cloud PubSub 生产者逻辑没有遇到问题 1,就像使用 Kafka 生产者时一样 enable.idempotence=true,您仍然会遇到问题 2 和 3。

如果不解决这些问题,您的处理语义将是您的消费者正在使用的交付语义,因此如果您选择手动提交偏移量,至少一次。