Hadoop MapReduce 中 Mapper 的最大输入文件大小（没有 split ）是多少？

Question

我编写了一个 MapReduce 作业，它可以处理一些 Protobuf 文件作为输入。由于文件的性质（不可拆分），每个文件都由一个映射器处理（实现了自定义 FileInputFormat，isSplitable 设置为 false）。该应用程序可以很好地处理小于 ~680MB 的输入文件大小并生成结果文件，但是，一旦输入文件大小超过该限制，该应用程序将成功完成但生成一个空文件。

我想知道我是否达到了 Mapper 文件大小的限制？如果重要，文件存储在 Google 存储 (GFS) 而不是 HDFS。

谢谢！

Answer 1

原来我遇到了一个众所周知的 Hadoop 错误 here。这里的问题是用于编写 Protobuf 文件的 BytesWritable class。在自定义 RecordReader 我以前做过

@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
    if(!processed){
        byte[] contents = new byte[(int) fileSplit.getLength()];
        Path file = fileSplit.getPath();
        log.debug("Path file:" + file);
        FileSystem fs = file.getFileSystem(conf);
        FSDataInputStream in = null;
        try{
            in = fs.open(file);
            IOUtils.readFully(in, contents, 0, contents.length);    
            value.set(contents, 0, contents.length);
        }catch(Exception e){
            log.error(e);
        }finally{
            IOUtils.closeQuietly(in);
        }
        processed = true;
        return true;
    }
    return false;
}

默认情况下，该错误将最大内容大小设置为 INTEGER。MAX_SIZE/3 约为 680MB。为了解决这个问题，我不得不通过

手动设置容量（my_max_size）

value.setCapacity(my_ideal_max_size)

在我value.set()之前。

希望这对其他人有帮助！

Hadoop MapReduce 中 Mapper 的最大输入文件大小（没有 split ）是多少？

What is the maximum input file size (without split ) for a Mapper in Hadoop MapReduce?

hadoop

mapreduce

hdfs

google-cloud-storage