使用 MapReduce 读取目录中的文件

Question

我的S3目录是

/sssssss/xxxxxx/rrrrrr/xx/file1
/sssssss/xxxxxx/rrrrrr/xx/file2
/sssssss/xxxxxx/rrrrrr/xx/file3
/sssssss/xxxxxx/rrrrrr/yy/file4
/sssssss/xxxxxx/rrrrrr/yy/file5
/sssssss/xxxxxx/rrrrrr/yy/file6

我的 mapreduce 程序如何读取 S3 上的这些文件？

Answer 1

对于一个输入路径，您执行以下操作：

FileInputFormat.addInputPath(job, new Path("/sssssss/xxxxxx/rrrrrr/xx/"));

对于两个输入路径，您执行以下操作：

FileInputFormat.addInputPath(job, new Path("/sssssss/xxxxxx/rrrrrr/xx/"));
FileInputFormat.addInputPath(job, new Path("/sssssss/xxxxxx/rrrrrr/yy/"));

或使用addInputPaths()。有关详细信息，请参阅 the documentation of FileInputPath（取决于您的 Hadoop 版本）。

Answer 2

可以通过以下方式简化:-

FileInputFormat.setInputDirRecursive(job, true);
FileInputFormat.addInputPaths(conf, args[0]);

您只需提供 s3 目录的基本路径，而不是每个文件的确切位置。它将转到包含文件的最后一个目录。

使用 MapReduce 读取目录中的文件

Using MapReduce to read the files within a directory

mapreduce

amazon-s3

amazon-web-services

amazon-emr

emr