Kafka Mirror Maker执行地
Kafka Mirror Maker place of execution
有一些最佳实践可以向 运行 目标集群上的 Mirror Maker 推荐。
https://community.hortonworks.com/articles/79891/kafka-mirror-maker-best-practices.html
我想知道为什么会有这个建议,因为最终所有数据都必须跨越集群之间的边界,无论它们是在目标处使用还是在源头产生。我可以想象的一个原因是 Mirror Maker 支持多个消费者但只支持一个生产者 - 因此使用多个消费者可能会加快在具有更大延迟的方式上消耗数据的速度。
如果多线程的性能很重要,那么使用多个生产者(每个消费者一个)来复制数据(使用自定义复制过程)是否有用?有谁知道为什么 Mirror Maker 在所有消费者中共享一个生产者?
我的用例是将数据从多个源集群 (~10) 复制到单个目标集群。我更愿意 运行 源集群上的复制过程,以避免目标集群上的许多复制过程(每个复制过程用于一个源)。
非常欢迎关于此主题的提示和建议。
我也在 Apache Kafka 邮件列表中提出了这个问题:
https://lists.apache.org/thread.html/06a3c3ec10e4c44695ad0536240450919843824fab206ae3f390a7b8@%3Cusers.kafka.apache.org%3E
我想在这里引用一些合理的答案:
Franz, you can run MM on or near either source or target cluster, but
it's more efficient near the target because this minimizes producer
latency. If latency is high, poducers will block waiting on ACKs for
in-flight records, which reduces throughput.
I recommend running MM near the target cluster but not necessarily on
the same machines, because often Kafka nodes are relatively expensive,
with SSD arrays and huge IO bandwidth etc, which isn't necessary for
MM.
Ryanne
和
Hi, Franz!
I guess, one of the reasons could be additional safety in case of
network split.
It is also some probability of bugs even with good software. So, If we
place MM on source cluster and network will split, consumers could
(theoretically) continue to read messages from source cluster and
commit them even without asks from destination cluster (one of
possible bugs). This way you will end up with lost messages on
producer after network fix.
On the other hand, if we place MM on destination cluster and network
will split, nothing bad happens. MM will be unable to grep data from
source cluster, so you data won’t corrupt even in case of bugs.
Tolya
有一些最佳实践可以向 运行 目标集群上的 Mirror Maker 推荐。 https://community.hortonworks.com/articles/79891/kafka-mirror-maker-best-practices.html
我想知道为什么会有这个建议,因为最终所有数据都必须跨越集群之间的边界,无论它们是在目标处使用还是在源头产生。我可以想象的一个原因是 Mirror Maker 支持多个消费者但只支持一个生产者 - 因此使用多个消费者可能会加快在具有更大延迟的方式上消耗数据的速度。
如果多线程的性能很重要,那么使用多个生产者(每个消费者一个)来复制数据(使用自定义复制过程)是否有用?有谁知道为什么 Mirror Maker 在所有消费者中共享一个生产者?
我的用例是将数据从多个源集群 (~10) 复制到单个目标集群。我更愿意 运行 源集群上的复制过程,以避免目标集群上的许多复制过程(每个复制过程用于一个源)。
非常欢迎关于此主题的提示和建议。
我也在 Apache Kafka 邮件列表中提出了这个问题:
https://lists.apache.org/thread.html/06a3c3ec10e4c44695ad0536240450919843824fab206ae3f390a7b8@%3Cusers.kafka.apache.org%3E
我想在这里引用一些合理的答案:
Franz, you can run MM on or near either source or target cluster, but it's more efficient near the target because this minimizes producer latency. If latency is high, poducers will block waiting on ACKs for in-flight records, which reduces throughput.
I recommend running MM near the target cluster but not necessarily on the same machines, because often Kafka nodes are relatively expensive, with SSD arrays and huge IO bandwidth etc, which isn't necessary for MM.
Ryanne
和
Hi, Franz!
I guess, one of the reasons could be additional safety in case of network split.
It is also some probability of bugs even with good software. So, If we place MM on source cluster and network will split, consumers could (theoretically) continue to read messages from source cluster and commit them even without asks from destination cluster (one of possible bugs). This way you will end up with lost messages on producer after network fix.
On the other hand, if we place MM on destination cluster and network will split, nothing bad happens. MM will be unable to grep data from source cluster, so you data won’t corrupt even in case of bugs.
Tolya