组合器和分区器的区别

Difference between combiner and partitioner

我是 MapReduce 的新手,我无法弄清楚分区器和组合器的区别。我知道 map 和 reduce 任务之间的中间步骤中的 运行 都减少了 reduce 任务要处理的数据量。请举例说明区别。

首先,同意@Binary nerd的评论

Combiner can be viewed as mini-reducers in the map phase. They perform a local-reduce on the mapper results before they are distributed further. Once the Combiner functionality is executed, it is then passed on to the Reducer for further work.

where as Partitioner come into the picture when we are working on more than one Reducer. So, the partitioner decide which reducer is responsible for a particular key. They basically take the Mapper Result(if Combiner is used then Combiner Result) and send it to the responsible Reducer based on the key

使用 Combiner 和 Partitioner 场景:

只有分区程序的情况:

示例:

  • Combiner Example

  • 分区程序示例:

    The partitioning phase takes place after the map phase and before the reduce phase. The number of partitions is equal to the number of reducers. The data gets partitioned across the reducers according to the partitioning function . The difference between a partitioner and a combiner is that the partitioner divides the data according to the number of reducers so that all the data in a single partition gets executed by a single reducer. However, the combiner functions similar to the reducer and processes the data in each partition. The combiner is an optimization to the reducer. The default partitioning function is the hash partitioning function where the hashing is done on the key. However it might be useful to partition the data according to some other function of the key or the value. -- Source

我认为一个小例子可以非常清楚和快速地解释这一点。

假设您有一个带有 2 个映射器和 1 个缩减器的 MapReduce 字数统计作业。

没有组合器。

"hello hello there" => mapper1 => (hello, 1), (hello,1), (there,1)

"howdy howdy again" => mapper2 => (howdy, 1), (howdy,1), (again,1)

两个输出都到达 reducer => (again, 1), (hello, 2), (howdy, 2), (there, 1)

使用Reducer作为Combiner

"hello hello there" => mapper1combiner => (hello, 2), (there,1)

"howdy howdy again" => mapper2combiner => (howdy, 2), (again,1)

两个输出都到达 reducer => (again, 1), (hello, 2), (howdy, 2), (there, 1)

结论

最终结果是一样的,但是当使用组合器时,地图输出已经减少了。在此示例中,您仅将 2 个输出对而不是 3 个输出对发送到减速器。所以你获得了 IO/disk 性能。这在聚合值时很有用。

Combiner 实际上是一个应用于 map() 输出的 Reducer。

如果你看一下第一个 Apache MapReduce tutorial,它恰好是我刚刚说明的 mapreduce 示例,你可以看到他们使用 reducer 作为组合器:

job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);