Spark Parquet read error : java.io.EOFException: Reached the end of stream with XXXXX bytes left to read

Question

在spark中读取parquet文件时，如果遇到以下问题。

应用 > 线程异常 "main" org.apache.spark.SparkException：作业因阶段失败而中止：阶段 2.0 中的任务 0 失败 4 次，最近的失败：阶段 2.0 中的任务 0.3 丢失（TID 44 , 10.23.5.196, executor 2): java.io.EOFException: 到达流的末尾，还有 193212 字节要读取应用 > 在 org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104) 应用 > 在 org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) 应用 > 在 org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) 应用 > 在 org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174) 应用 > 在 org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805) 应用 > 在 org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301) 应用 > 在 org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256) 应用 > 在 org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159) 应用 > 在 org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) 应用 > 在 org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:124) 应用 > 在 org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:215)

对于以下 spark 命令：

val df = spark.read.parquet("s3a://.../file.parquet")
df.show(5, false)

Answer 1

我认为你可以通过

绕过这个问题

--conf  spark.sql.parquet.enableVectorizedReader=false

Answer 2

对我来说没有成功，但下面的成功了：

--conf spark.hadoop.fs.s3a.experimental.input.fadvise=sequential

不知道为什么，但给了我提示的是 this issue and some details about the options here。

Spark Parquet read error : java.io.EOFException: Reached the end of stream with XXXXX bytes left to read

Spark Parquet read error : java.io.EOFException: Reached the end of stream with XXXXX bytes left to read

apache-spark

parquet

apache-spark-sql