在 MapReduce 中调用具有不同 InputFormatClass 的多个 Mapper

Question

我想编写一个包含三个 Mapper 的代码，其中两个将处理 ".csv" 个文件，其他是 ".xml"。我已经从 here

为 .xml 格式编写了 XmlInputFormat

现在我想知道我应该输入什么

job.setInputFormatClass(...);

还有我应该添加哪个以提供文件路径。

 TextInputFormat.addInputPath(...)
 TextOutputFormat.setInputPath(...)

或

TextInputFormat.addInputPath(...)
TextOutputFormat.setInputPath(...)

Answer 1

您应该考虑编写两个映射器，一个处理 .csv 文件，另一个 .xml。但是，两个映射器都应该产生 key-value 个 same type，以便单个 reducer 来处理它。

这里有一个使用 org.apache.hadoop.mapred.lib.MultipleInputs 的例子：

MultipleInputs.addInputPath(jobConf, 
                     new Path(csvFilePath),       
                     SequenceFileInputFormat.class, 
                     CSVProcessingMapper.class);
MultipleInputs.addInputPath(jobConf, 
                     new Path(xmlFilePath), 
                     XmlInputFormat.class, 
                     XMLProcessingMapper.class);

这里CSVProcessingMapper.class和XmlInputFormat.class是CSV和XML处理映射器。您可以为不同的输入类型设置尽可能多的映射器。同样SequenceFileInputFormat.class和XmlInputFormat.class类是对应的输入格式类.

在 MapReduce 中调用具有不同 InputFormatClass 的多个 Mapper

Calling more than one Mapper with different InputFormatClass in MapReduce

java

xml

hadoop

mapreduce

bigdata