根据reducer上的长度对单词进行分类

Categorizing words according to their length on reducer

我是 MapReduce 应用程序的新手。我只是想在我的数据集上找到单词的长度,并根据它们的长度将它们分类为 tiny、little、med、huge,最后,我想看看有多少单词是 tiny、little、med或我在 Java 中的数据集很大,但我在实施 reducer 时遇到问题。当我在 Hadoop 集群上执行 jar 文件时,它没有 return 任何结果。如果有人帮助我,我将不胜感激。这是我尝试执行的 reducer 代码,但我猜有很多错误。

public class WordSizeReducer extends Reducer<IntWritable, IntWritable, Text, IntWritable> {
    private IntVariable result = new IntVariable();
    IntWritable tin, smal, mediu,bi;
    int t, s, m, b;
    int count;
    Text tiny, small, medium, big;

    public void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException{

        for (IntWritable val:values){   
            if(val.get() == 1){
                tin.set(t);
                t++;                            
                }
            else if(2<=val.get() && val.get()<=4){
                smal.set(s);
                s++;                
                }
            else if(5<=val.get() && val.get()<=9){
                mediu.set(m);
                m++;                
                }
            else if(10<=val.get()){
                bi.set(b);
                b++;    }

        }       
        context.write(tiny, tin);
        context.write(small, smal);
        context.write(medium, mediu);
        context.write(big, bi); 
    }
}

public class WordSizeMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private IntWritable wordLength = new IntWritable();

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            wordLength.set(tokenizer.nextToken().length());
            context.write(wordLength, one);     
        }
    }
}

tinysmallmediumbig 从未初始化,因此它们将为空。

这意味着您的所有 context.write() 调用都使用了空键。

显然,这不好,因为您将无法区分不同字长的计数。

更糟糕的是,tinsmalmediubi 从未被初始化,当您尝试调用 set() 在它们上面(你正确地初始化了 result,但之后永远不会使用它)。

(此外,您不需要在循环中重复设置 IntWritables 值;只需更新 t,s,m,b 然后在最后设置一次 IntWritables context.write() 个调用)

更新映射器代码:

对于输入中的每个单词,您正在编写键值对(长度,1)。

reducer 将收集具有相同键的所有值,因此将调用它,例如:

(2, [1,1,1,1,1,1,1,1,])
(3, [1,1,1])

所以你的 reducer 只会看到值“1”,它被错误地视为一个字长。其实关键是字长

现在更新堆栈跟踪:

错误消息解释了错误所在 - Hadoop 无法找到您的作业 classes,因此它们根本没有被执行。错误说:

java.lang.ClassNotFoundException: WordSize.WordsizeMapper

但是你的 class 被称为 WordSizeMapper(或者如果你有一个外部 class 则可能是 WordSize.WordSizeMapper)——注意 "size"/ 的不同大小写"Size"!您需要检查您是如何调用 Hadoop 的。

没办法,我也检查了我的代码,我做了一些修复,但结果是一样的,在hadoop终端上window,我无法得到任何结果。最新版本的代码如下:

    public class WordSizeTest {
    public static void main(String[] args) throws Exception{
        if(args.length != 2)
        {
            System.err.println("Usage: Word Size <in> <out>");
            System.exit(2);
        } 
        Job job = new Job();    
        job.setJarByClass(WordSizeTest.class); 
        job.setJobName("Word Size");
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        job.setMapperClass(WordSizeMapper.class); 
        job.setReducerClass(WordSizeReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class); 
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}
public class WordSizeMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
    final static IntWritable one = new IntWritable(1);
    IntWritable wordLength = new IntWritable();
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
    {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            wordLength.set(tokenizer.nextToken().length());
            context.write(wordLength, one);     
    }
    }
}
public class WordSizeReducer extends Reducer<IntWritable, IntWritable, Text, IntWritable>{
    IntWritable tin = new IntWritable();
    IntWritable smal = new IntWritable();
    IntWritable mediu = new IntWritable();
    IntWritable bi = new IntWritable();
    int t, s, m, b;
    Text tiny = new Text("tiny");
    Text small = new Text("small");
    Text medium = new Text("medium");
    Text big = new Text("big");
    public void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException{        
        for (IntWritable val:values){
            if(key.get() == 1){
                t += val.get();                         
                }
            else if(2<=key.get() && key.get()<=4){
                s += val.get();             
                }
            else if(5<=key.get() && key.get()<=9){
                m += val.get();             
                }
            else if(10<=key.get()){
                b += val.get();             
                }

        }
        tin.set(t); 
        smal.set(s);
        mediu.set(m);
        bi.set(b);
        context.write(tiny, tin);
        context.write(small, smal);
        context.write(medium, mediu);
        context.write(big, bi); 
    }
    }

终端报错就是这样,

15/02/01 12:09:25 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
15/02/01 12:09:25 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
15/02/01 12:09:25 INFO input.FileInputFormat: Total input paths to process : 925
15/02/01 12:09:25 WARN snappy.LoadSnappy: Snappy native library is available
15/02/01 12:09:25 INFO util.NativeCodeLoader: Loaded the native-hadoop library
15/02/01 12:09:25 INFO snappy.LoadSnappy: Snappy native library loaded
15/02/01 12:09:29 INFO mapred.JobClient: Running job: job_201501191143_0177
15/02/01 12:09:30 INFO mapred.JobClient:  map 0% reduce 0%
15/02/01 12:09:47 INFO mapred.JobClient: Task Id : attempt_201501191143_0177_m_000001_0, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException: WordSize.WordSizeMapper
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:859)
    at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:718)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
    at org.apache.hadoop.mapred.Child.run(Child.java:255)
    at java.security.AccessController.doPrivileged(AccessController.java:310)
    at javax.security.auth.Subject.doAs(Subject.java:573)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.ClassNotFoundException: WordSize.WordsizeMapper
    at java.lang.Class.forName(Class.java:174)
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:812)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857)
    ... 8 more
15/02/01 12:09:49 INFO mapred.JobClient: Task Id : attempt_201501191143_0177_m_000000_0, Status : FAILED