docker 镶木地板错误中的 Spark 未找到预定义架构
Spark in docker parquet error No predefined schema found
我有一个https://github.com/gettyimages/docker-spark based local spark test cluster including R. In particular, this image is used: https://hub.docker.com/r/possibly/spark/
尝试使用 sparkR 读取镶木地板文件时发生此异常。读取 parquet 文件在本地 spark 安装上没有任何问题。
myData.parquet <- read.parquet(sqlContext, "/mappedFolder/myFile.parquet")
16/03/29 20:36:02 ERROR RBackendHandler: parquet on 4 failed
Fehler in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
java.lang.AssertionError: assertion failed: No predefined schema found, and no Parquet data files or summary files found under file:/mappedFolder/myFile.parquet.
at scala.Predef$.assert(Predef.scala:179)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$MetadataCache$$readSchema(ParquetRelation.scala:512)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache$$anonfun.apply(ParquetRelation.scala:421)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache$$anonfun.apply(ParquetRelation.scala:421)
at scala.Option.orElse(Option.scala:257)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache.refresh(ParquetRelation.scala:421)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$$metadataCac
奇怪的是,相同的错误是相同的 - 即使是不存在的文件。
但是在终端中我可以看到文件在那里:
/mappedFolder/myFile.parquet
root@worker:/mappedFolder/myFile.parquet# ls
_common_metadata part-r-00097-e5221f6f-e125-4f52-9f6d-4f38485787b3.gz.parquet part-r-00196-e5221f6f-e125-4f52-9f6d-4f38485787b3.gz.parquet
....
在我测试 dockerized spark 的过程中,我的初始镶木地板文件似乎已损坏。
解决方法:从原始来源重新创建 parquet 文件
我有一个https://github.com/gettyimages/docker-spark based local spark test cluster including R. In particular, this image is used: https://hub.docker.com/r/possibly/spark/
尝试使用 sparkR 读取镶木地板文件时发生此异常。读取 parquet 文件在本地 spark 安装上没有任何问题。
myData.parquet <- read.parquet(sqlContext, "/mappedFolder/myFile.parquet")
16/03/29 20:36:02 ERROR RBackendHandler: parquet on 4 failed
Fehler in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
java.lang.AssertionError: assertion failed: No predefined schema found, and no Parquet data files or summary files found under file:/mappedFolder/myFile.parquet.
at scala.Predef$.assert(Predef.scala:179)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$MetadataCache$$readSchema(ParquetRelation.scala:512)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache$$anonfun.apply(ParquetRelation.scala:421)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache$$anonfun.apply(ParquetRelation.scala:421)
at scala.Option.orElse(Option.scala:257)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache.refresh(ParquetRelation.scala:421)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$$metadataCac
奇怪的是,相同的错误是相同的 - 即使是不存在的文件。
但是在终端中我可以看到文件在那里:
/mappedFolder/myFile.parquet
root@worker:/mappedFolder/myFile.parquet# ls
_common_metadata part-r-00097-e5221f6f-e125-4f52-9f6d-4f38485787b3.gz.parquet part-r-00196-e5221f6f-e125-4f52-9f6d-4f38485787b3.gz.parquet
....
在我测试 dockerized spark 的过程中,我的初始镶木地板文件似乎已损坏。
解决方法:从原始来源重新创建 parquet 文件