读取 csv MapReduce 中的空单元格时的 ArrayIndexOutofBounds

Question

我正在尝试运行 MapReduce 程序来处理以下数据。

这是我的映射器代码：

@Override
protected void map(Object key, Text value, Mapper.Context context) throws IOException, ArrayIndexOutOfBoundsException,InterruptedException {
    String tokens[]=value.toString().split(",");
    if(tokens[6]!=null){
        context.write(new Text(tokens[6]), new IntWritable(1));
    }

}

由于我的某些单元格数据为空，当我尝试读取列 Carrier_delay 时，出现以下错误。请指教

17/04/13 20:45:29 INFO mapreduce.Job: Task Id : attempt_1491849620104_0017_m_000000_0, Status : FAILED
Error: java.lang.ArrayIndexOutOfBoundsException: 6
    at Test.TestMapper.map(TestMapper.java:22)
    at Test.TestMapper.map(TestMapper.java:17)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
    at org.apache.hadoop.mapred.YarnChild.run(YarnChild.java:168)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)

Configuration conf = new Configuration();
Job job = Job.getInstance(conf,"IP Access");
job.setJarByClass(Test.class);
job.setMapperClass(TestMapper.class);

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);

job.setReducerClass(TestReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);

Answer 1

所有的列都是图中显示的那一列吗？如果是这种情况，请记住 java 数组是 0 索引的，并且您的列的范围是 0 到 5，因此 tokens[6] 它超出了范围。或者根据您的必要逻辑，您还可以在 if:

中添加验证

if(tokens.length > n && tokens[n]!=null){ context.write(new Text(tokens[n]), new IntWritable(1)); }

Answer 2

载波延迟是第二个字段，因此您需要使用令牌[1] 访问，因为数组索引从 0 开始。您也可以在访问特定索引之前进行长度检查。 Token[6] 给出错误，因为您总共有 6 列。如果您正在访问最后一个字段，它将是 Token[5] 即长度减去 1.

Answer 3

问题在行中：if(tokens[6]!=null){。

问题是你想取tokens[6]的值，然后检查它是否为null。但是，有些行仅包含六列（第七列为空），因此 tokens 在这些情况下是一个六元素数组。这意味着它包含从 tokens[0] 到 tokens[5] 的值。当您尝试访问 tokens[6] 时，您超出了数组的大小，因此您会得到一个 ArrayIndexOutOfBoundsException。

做你想做的正确方法是：

IntWritable one = new IntWritable(1); //this saves some time ;)
Text keyOutput = new Text(); //the same goes here

@Override
protected void map(Object key, Text value, Mapper.Context context) throws IOException, ArrayIndexOutOfBoundsException,InterruptedException {
    String tokens[]=value.toString().split(",");
    if(tokens.length == 7){
        keyOutput.set(tokens[6]);
        context.write(keyOutput, one);
    }

}

更多提示：从你的部分代码来看，我猜你是想统计某个特定的载波延迟值出现的次数。在这种情况下，您还可以使用组合器来加速该过程，就像 WordCount 程序所做的那样。您还可以将载波延迟解析为 IntWritable 以节省时间和 space.

读取 csv MapReduce 中的空单元格时的 ArrayIndexOutofBounds

ArrayIndexOutofBounds when reading empty cells in csv MapReduce

hadoop

mapreduce

hdfs