读取 hdfs sequece 文件的两种方法,但一种因 EOFException 而失败

Two ways to read hdfs sequece file, but one failed with EOFException

使用sparkContext读取序列文件,如下:

方法一:

val rdd = sc.sequenceFile(path, classOf[BytesWritable], 
          classOf[BytesWritable])
rdd.count()

方法二:

val rdd = sc.hadoopFile(path,
          classOf[SequenceFileAsBinaryInputFormat],
          classOf[BytesWritable],
          classOf[BytesWritable])
rdd.count()

方法 1 以 EOFException 结束,但方法 2 有效。这两种方法有什么区别?

不同之处在于 "Method 1" 立即调用 hadoopFile(path, inputFormatClass, keyClass, valueClass, minPartitions),它使用 SequenceFileInputFormat[BytesWritable, BytesWritable],但是 "Method 2" 进行相同的调用,当然除了使用 SequenceFileAsBinaryInputFormat .
然后继续,即使 SequenceFileAsBinaryInputFormat 扩展了 SequenceFileInputFormat[BytesWritable, BytesWritable]SequenceFileAsBinaryInputFormat 有它自己的内部 class 称为 SequenceFileAsBinaryRecordReader,尽管它的工作方式类似于 SequenceFileRecordReader[BytesWritable, BytesWritable],有差异。当我们查看代码时,他们正在做一些不同的实现,即前者更好地处理压缩。因此,如果您的序列文件是记录压缩的或块压缩的,那么 SequenceFileInputFormat[BytesWritable, BytesWritable] 没有以与 SequenceFileAsBinaryInputFormat 相同的可靠性进行迭代是有道理的。

SequenceFileAsBinaryInputFormat 使用 SequenceFileAsBinaryRecordReader(第 102-115 行)- https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/SequenceFileAsBinaryInputFormat.java

SequenceFileRecordReader(第 79 - 91 行)- https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/SequenceFileRecordReader.java