读取 hdfs sequece 文件的两种方法,但一种因 EOFException 而失败
Two ways to read hdfs sequece file, but one failed with EOFException
使用sparkContext读取序列文件,如下:
方法一:
val rdd = sc.sequenceFile(path, classOf[BytesWritable],
classOf[BytesWritable])
rdd.count()
方法二:
val rdd = sc.hadoopFile(path,
classOf[SequenceFileAsBinaryInputFormat],
classOf[BytesWritable],
classOf[BytesWritable])
rdd.count()
方法 1 以 EOFException 结束,但方法 2 有效。这两种方法有什么区别?
不同之处在于 "Method 1" 立即调用 hadoopFile(path, inputFormatClass, keyClass, valueClass, minPartitions)
,它使用 SequenceFileInputFormat[BytesWritable, BytesWritable]
,但是 "Method 2" 进行相同的调用,当然除了使用 SequenceFileAsBinaryInputFormat
.
然后继续,即使 SequenceFileAsBinaryInputFormat
扩展了 SequenceFileInputFormat[BytesWritable, BytesWritable]
,SequenceFileAsBinaryInputFormat
有它自己的内部 class 称为 SequenceFileAsBinaryRecordReader
,尽管它的工作方式类似于 SequenceFileRecordReader[BytesWritable, BytesWritable]
,有差异。当我们查看代码时,他们正在做一些不同的实现,即前者更好地处理压缩。因此,如果您的序列文件是记录压缩的或块压缩的,那么 SequenceFileInputFormat[BytesWritable, BytesWritable]
没有以与 SequenceFileAsBinaryInputFormat
相同的可靠性进行迭代是有道理的。
SequenceFileAsBinaryInputFormat
使用 SequenceFileAsBinaryRecordReader
(第 102-115 行)- https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/SequenceFileAsBinaryInputFormat.java
SequenceFileRecordReader
(第 79 - 91 行)-
https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/SequenceFileRecordReader.java
使用sparkContext读取序列文件,如下:
方法一:
val rdd = sc.sequenceFile(path, classOf[BytesWritable],
classOf[BytesWritable])
rdd.count()
方法二:
val rdd = sc.hadoopFile(path,
classOf[SequenceFileAsBinaryInputFormat],
classOf[BytesWritable],
classOf[BytesWritable])
rdd.count()
方法 1 以 EOFException 结束,但方法 2 有效。这两种方法有什么区别?
不同之处在于 "Method 1" 立即调用 hadoopFile(path, inputFormatClass, keyClass, valueClass, minPartitions)
,它使用 SequenceFileInputFormat[BytesWritable, BytesWritable]
,但是 "Method 2" 进行相同的调用,当然除了使用 SequenceFileAsBinaryInputFormat
.
然后继续,即使 SequenceFileAsBinaryInputFormat
扩展了 SequenceFileInputFormat[BytesWritable, BytesWritable]
,SequenceFileAsBinaryInputFormat
有它自己的内部 class 称为 SequenceFileAsBinaryRecordReader
,尽管它的工作方式类似于 SequenceFileRecordReader[BytesWritable, BytesWritable]
,有差异。当我们查看代码时,他们正在做一些不同的实现,即前者更好地处理压缩。因此,如果您的序列文件是记录压缩的或块压缩的,那么 SequenceFileInputFormat[BytesWritable, BytesWritable]
没有以与 SequenceFileAsBinaryInputFormat
相同的可靠性进行迭代是有道理的。
SequenceFileAsBinaryInputFormat
使用 SequenceFileAsBinaryRecordReader
(第 102-115 行)- https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/SequenceFileAsBinaryInputFormat.java
SequenceFileRecordReader
(第 79 - 91 行)-
https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/SequenceFileRecordReader.java