MapReduce 中的自定义动态分区

Question

我正在使用 MapReduce 处理我的数据。我需要将输出存储在日期分区下。我的排序键是日期字符串。现在，如果我将自定义分区程序 class 中的 getPartition 重写为 return 以下内容：

return (formattedDate.hashCode() & Integer.MAX_VALUE) % numReduceTasks;

因为我们使用散列和 Mod，在某些情况下我们 return 相同的整数值例如：假设 numReduceTasks=100

Now the date 2018-01-20 might have hash value as 101. so 101%100 = 1

Now take other date as 2018-02-20 and might have hash value as 201. so 201%100 = 1 因此，我们最终将多个日期文件转到单个日期分区。这是不希望的。关于如何处理这个问题的任何指示？

Answer 1

我认为在这种情况下你不应该探索使用 Partitioner 和多个 reducer。除非您知道数据集中有多少个唯一日期，否则您将无法设置缩减器的数量。

改为使用 MultipleOutputs 将输出发送到多个文件。（文件，但不是目录）。如果您需要跨不同的目录发送它们，您可以在 MR 之后的驱动程序调用中执行一个步骤，该步骤将迭代输出目录并根据文件名开始模式将文件移动到目录，在这种情况下将是日期值。

有关使用 MO 的示例，请参阅 this。

另一种选择是运行法线图缩减，将输出存储到常规 o/p 目录，在其上创建配置单元 table 并执行动态分区以发送根据您的日期列输出到不同的目录。

Answer 2

多种格式是 worked.It 也适用于创建目录的解决方案。权威指南帮助我解决了这个问题。

The base path specified in the write() method of MultipleOutputs is interpreted relative to the output directory, and because it may contain file path separator characters (/), it’s possible to create subdirectories of arbitrary depth. For example, the following modification partitions the data by station and year so that each year’s data is contained in a directory named by the station ID (such as 029070-99999/1901/part-r-00000)

MapReduce 中的自定义动态分区

Custom Dynamic Partitions in MapReduce

hadoop

hive

partitioning

mapreduce

bigdata