kafka partitions 的个数会增加Spark写入kafka的速度吗？

Does the number of kafka partitions increase the speed of Spark writing to kafka?

读取时，Spark 有一个 1:1 到 kafka 分区的映射，因此，有了更多的分区，我们可以在我们的工作中利用更多的并行性。

但是Spark在kafka写的时候适用吗？在具有 4 个分区的一个主题中写入相同的数据集比在具有 1 个分区的主题中写入更快？

是的。

如果您的主题有 1 个分区，则表示它在一个代理中。因此，如果您提高该主题的生产者率，那么该经纪人就会变得忙碌。但是，如果您有多个分区，您的 Kafka 集群会将这些分区共享到不同的代理中，并在多个代理中共享这些生产率。因此，在具有 4 个分区的一个主题中写入相同的数据集比在具有 1 个分区的主题中写入更快。

这不仅是生产率。在 Kafka 代理中，有多个进程，如压缩、压缩、分段等......因此随着消息数量的增加，工作负载会变高。但是在多个代理中有多个分区，它将被分发。

但是，您不一定要使用比需要更多的分区，因为增加分区数会同时增加打开的服务器文件的数量并导致复制延迟增加。

来自 kafka documentation

Distribution The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance. Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.

kafka partitions 的个数会增加Spark写入kafka的速度吗？

Does the number of kafka partitions increase the speed of Spark writing to kafka?

apache-kafka

apache-spark

spark-structured-streaming