Confluent Cloud 上的 Apache Kafka - 分区主题和消费者滞后中的不一致偏移

Question

我在 Confluent Cloud 上使用 Kafka 时发现了一个奇怪的行为。我创建了一个默认分区值的主题：6.

我的系统包含一个 Java 向该主题发送消息的生产者应用程序和一个从中读取消息并针对每条消息执行操作的 Kafka Streams 应用程序。

-----------------------          --------            -----------
| Kafka Java Producer |  ---->  | topic | ---->      | KStream |
-----------------------          --------            -----------

目前我只启动了一个 Kafka Streams 应用程序实例，因此消费者组只有一个成员。

这是我观察到的：

生产者发送一条消息并记录在事件主题中offset 0:

消息到达 KStream，正在正确处理，正如我在 KStream 日志跟踪中看到的那样：

KStream

events.foreach { key, value ->
    logger.info("--------> Processing TimeMetric {}", value)
    //Store in DB

日志

[-StreamThread-1] uration$$EnhancerBySpringCGLIB$$e72e3f00 : --------> Processing Event {"...

在 Confluent Cloud 消费者滞后中，我可以看到所有消费者组及其状态。 KStream 有一个名为 events-processor-19549050-d8b0-4b39...。如前所述，该组只有一个成员（KStream 的唯一实例）。但是，如果显示该组在分区 2 中的一条消息之后。此外，请注意当前偏移量似乎为 1，结束偏移量为 2):

如果我在生产者中发送另一条消息，它会再次记录在主题中，但这次 偏移量 2 而不是 1:

消息到达KStream再次正常处理：

[-StreamThread-1] uration$$EnhancerBySpringCGLIB$$e72e3f00 : --------> Processing Event {

回到消费者组的消费者滞后，它仍然落后一条消息，仍然有一些奇怪的偏移量（当前 3，结束 4）：

虽然处理的好像没问题，但是上面显示的状态没有多大意义。你能解释一下原因吗:

消息偏移量递增 +2 而不是 +1？
消费者组似乎落后了 1 条消息，即使它正确处理了消息？

Answer 1

对于第一个问题，有两种可能性（尽管通过阅读第二个问题，您似乎正在使用交易）：

如果您不使用exactly-once语义，生产者可能会发送多个消息，因为没有控制在之前发送的内容 上。这样，由于那些重复的消息，Kafka 的默认 at-least-once 语义可能会增加您的偏移量 >+1。
如果您使用exactly-once语义，或交易，交易的每个事件写入标记到主题中，用于内部控制目的。这些标记是 +2 增加的原因，因为它们也存储在主题中（但被消费者避免）。 Confluent's guide to transactions 也有一些关于此行为的信息：

After the producer initiates a commit (or an abort), the coordinator begins the two phase commit protocol.

In the first phase, the coordinator updates its internal state to “prepare_commit” and updates this state in the transaction log. Once this is done the transaction is guaranteed to be committed no matter what.

The coordinator then begins phase 2, where it writes transaction commit markers to the topic-partitions which are part of the transaction.

These transaction markers are not exposed to applications, but are used by consumers in read_committed mode to filter out messages from aborted transactions and to not return messages which are part of open transactions (i.e., those which are in the log but don’t have a transaction marker associated with them).

Once the markers are written, the transaction coordinator marks the transaction as “complete” and the producer can start the next transaction.

一般来说，你不应该关心偏移量，因为它不是 definitive-guide 可以看的。例如，重试、重复或事务标记会使偏移量与您对生产者的预期不同，但您不必担心；您的消费者会，他们只会处理 "real" 条消息。

关于问题 2，这是一个已知问题：https://issues.apache.org/jira/browse/KAFKA-6607

引用 jira:

When an input topic for a Kafka Streams application is written using transaction, Kafka Streams does not commit "endOffset" but "endOffset - 1" if it reaches the end of topic. The reason is the commit marker that is the last "message" in the topic; Streams commit "offset of last processed message plus 1" and does not take commit markers into account.

This is not a correctness issue, but when one inspect the consumer lag via bin/kafka-consumer.group.sh the lag is show as 1 instead of 0 – what is correct from consumer-group tool point of view.

希望对您有所帮助！

Confluent Cloud 上的 Apache Kafka - 分区主题和消费者滞后中的不一致偏移

Apache Kafka on Confluent Cloud - Incoherent offsets in partitioned topic and consumer lag

apache-kafka

spring-cloud-stream

apache-kafka-streams

confluent-cloud

confluent-platform