在 Map-reduce 输出文件中获取未知整数值

Getting unknown integer value in Map-reduce output file

我正在开发一个 hadoop map-reduce 程序,我没有设置映射器和缩减器,也没有为我的程序的作业配置设置任何其他参数。我这样做是假设作业会将与输入相同的输出发送到输出文件。 但是我发现它在输出文件中打印了一些虚拟整数值,每一行都用制表符分隔(我猜)。

这是我的代码:

import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class MinimalMapReduce extends Configured implements Tool {

    public int run(String[] args) throws Exception {

        Job job = new Job(getConf());
        job.setJarByClass(getClass());
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(String[] args) {
        String argg[] = {"/Users/***/Documents/hadoop/input/input.txt",
                            "/Users/***/Documents/hadoop/output_MinimalMapReduce"}; 
        try{
            int exitCode = ToolRunner.run(new MinimalMapReduce(), argg);
            System.exit(exitCode);
        }catch(Exception e){
            e.printStackTrace();
        }
    }
}

这里是输入:

2011 22
2011 25
2012 40
2013 35
2013 38
2014 44
2015 43

这是输出:

0   2011 22
8   2011 25
16  2012 40
24  2013 35
32  2013 38
40  2014 44
48  2015 43

如何获得与输入相同的输出?

I did so assuming that the the Job will send the same output as the input to the output file

你的假设是正确的。从技术上讲,您将获得文件中的任何内容作为输出。请记住,映射器和缩减器将键值对作为输入。

映射器的输入是文件的输入拆分,缩减器的输入是映射器的输出。

But what i found that it is printing some dummy integer value in the output file with every line separated by tab

这些虚拟整数不过是该行距文件开头的偏移量。由于您的每一行都包含 [4 DIGITS]<space>[2 DIGITS]<new-line>,因此您的偏移量是八的倍数。

您可能会问,既然您还没有定义任何映射器或缩减器,为什么会得到这个偏移量?这是因为,映射器将始终 运行 完成将每一行映射到它的偏移量的任务,并被称为 IdentityMapper.

How can i get the same ouput as the input?

好吧,您可以定义一个映射器并将输入行映射到输出并去除偏移量。

public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
    // Some cool logic here
}

在上面的代码中,key 包含 虚拟整数值 即偏移量。 value 包含每一行的值,一次一个。 您可以使用 context.write 函数编写自己的代码来编写 value,然后不使用减速器并设置 job.setNumReduceTasks(0) 以获得所需的输出。

我同意@philantrovert 的回答,但这是我找到的更多详细信息。 根据 Hadoop- The Definitive Guide,TextInputFormat 将偏移量添加到行号。这是有关 TextInputFormat 的文档:

TextInputFormat is the default InputFormat. Each record is a line of input. The key, a LongWritable, is the byte offset within the file of the beginning of the line. The value is the contents of the line, excluding any line terminators (e.g., newline or carriage return), and is packaged as a Text object. So, a file containing the following text:

On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.

is divided into one split of four records. The records are interpreted as the following key-value pairs:

(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)

Clearly, the keys are not line numbers. This would be impossible to implement in general, in that a file is broken into splits at byte, not line, boundaries. Splits are processed independently. Line numbers are really a sequential notion. You have to keep a count of lines as you consume them, so knowing the line number within a split would be possible, but not within the file.

However, the offset within the file of each line is known by each split independently of the other splits, since each split knows the size of the preceding splits and just adds this onto the offsets within the split to produce a global file offset. The offset is usually sufficient for applications that need a unique identifier for each line. Combined with the file’s name, it is unique within the filesystem. Of course, if all the lines are a fixed width, calculating the line number is simply a matter of dividing the offset by the width.