改进 Wordcount 中的身份映射器

Question

我创建了一个映射方法来读取 wordcount 示例 [1] 的映射输出。此示例不使用 MapReduce 提供的 IdentityMapper.class，但这是我发现为 Wordcount 创建工作 IdentityMapper 的唯一方法。唯一的问题是这个 Mapper 花费的时间比我想要的要多得多。我开始想也许我在做一些多余的事情。对改进我的 WordCountIdentityMapper 代码有什么帮助吗？

[1] 身份映射器

public class WordCountIdentityMapper extends MyMapper<LongWritable, Text, Text, IntWritable> {
    private Text word = new Text();

    public void map(LongWritable key, Text value, Context context
    ) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());
        word.set(itr.nextToken());
        Integer val = Integer.valueOf(itr.nextToken());
        context.write(word, new IntWritable(val));
    }

    public void run(Context context) throws IOException, InterruptedException {
        while (context.nextKeyValue()) {
            map(context.getCurrentKey(), context.getCurrentValue(), context);
        }
    }
}

[2] 生成地图输出的地图 class

public static class MyMap extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value, Context context
    ) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());

        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, one);
        }
    }

    public void run(Context context) throws IOException, InterruptedException {
        try {
            while (context.nextKeyValue()) {
                map(context.getCurrentKey(), context.getCurrentValue(), context);
            }
        } finally {
            cleanup(context);
        }
    }
}

谢谢，

Answer 1

解决方法是用 indexOf() 方法替换 StringTokenizer。它工作得更好。我获得了更好的表现。

改进 Wordcount 中的身份映射器

Improve identity mapper in Wordcount

hadoop

mapreduce

hadoop-yarn