Hadoop Mapper参数说明

Question

我是 Hadoop 的新手，对 Mapper 参数感到困惑。

以众所周知的WordCount为例：

class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
  private Text outputKey;
  private IntWritable outputVal;

  @Override
  public void setup(Context context) {
    outputKey = new Text();
    outputVal = new IntWritable(1);
  }

  @Override
  public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
    StringTokenizer stk = new StringTokenizer(value.toString());
    while(stk.hasMoreTokens()) {
      outputKey.set(stk.nextToken());
      context.write(outputKey, outputVal);
    }
  }
}

参见map函数，参数有Object key、Text value和Context context，我很困惑Object key长什么样子（你看，key 从未在 Map 函数中使用过。

由于输入文件格式如下：

Deer
Beer
Bear
Beer
Deer
Deer
Bear
...

我知道 value 看起来像每一行 Deer、Beer 等等。它们是逐行处理的。

但是 key 看起来怎么样？如何决定 key 应该使用哪种数据类型？

Answer 1

这里的一切都取决于你InputFormatclass的使用情况。它解析输入数据源并为您提供 (Key, Value) 对。即使具有相同的输入源，不同的输入格式实现也可以为您提供不同的流。

这是演示方法的文章：

https://hadoopi.wordpress.com/2013/05/31/custom-recordreader-processing-string-pattern-delimited-records/

这里的主要驱动程序是 RecordReader。

Hadoop Mapper参数说明

Hadoop Mapper parameters explaination

java

hadoop

mapreduce

distributed-computing

cluster-computing