我可以将 task.commit.ms 设置为每 1 毫秒吗?

Can I set task.commit.ms to every 1ms?

我有一个使用 Apache-Samza 的项目,但我遇到了重复数据的问题。

这是我的检查点配置:

task.checkpoint.factory=org.apache.samza.checkpoint.kafka.KafkaCheckpointManagerFactory
task.checkpoint.system=kafka
task.checkpoint.replication.factor=2
task.commit.ms=20000

在文档中我们可以读到这个:

If task.checkpoint.factory is configured, this property determines how often a checkpoint is written. The value is the time between checkpoints, in milliseconds. The frequency of checkpointing affects failure recovery: if a container fails unexpectedly (e.g. due to crash or machine failure) and is restarted, it resumes processing at the last checkpoint. Any messages processed since the last checkpoint on the failed container are processed again. Checkpointing more frequently reduces the number of messages that may be processed twice, but also uses more resources.

那我可以把task.commit.ms=20000改成250ms还是1ms。是好是坏?我有一个很好的集群。

为什么我需要改变这个,因为这个 Samza(工人)每周崩溃 1-3 次。现在暂时的解决办法是每次提交偏移量。


文档参考:

Appache-Samza

Apache-Samza-Configuration

我的解决方案我知道这不是解决所有问题的方法。它将 task.commit.ms 更改为与 task.shutdown.ms=5000 相同的内容。

Atlas-Samza-Configuration Shutdown