如何从 parquet 文件读取和写入自定义 class
How to read and write a custom class from parquet file
我正在尝试使用 DataFrame/datasets
为特定 class 类型编写镶木地板 read/write class
class 架构:
class A {
long count;
List<B> listOfValues;
}
class B {
String id;
long count;
}
代码:
String path = "some path";
List<A> entries = somerandomAentries();
JavaRDD<A> rdd = sc.parallelize(entries, 1);
DataFrame df = sqlContext.createDataFrame(rdd, A.class);
df.write().parquet(path);
DataFrame newDataDF = sqlContext.read().parquet(path);
newDataDF.show();
当我尝试 运行 时,它会抛出一个错误。我在这里错过了什么?创建数据框时是否需要为整个 class 提供架构
错误:
Caused by: scala.MatchError: B(Id=abc, count=0) (of class B)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:169)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter.apply(CatalystTypeConverters.scala:401)
at org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$$anonfun$apply.apply(SQLContext.scala:1358)
at org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$$anonfun$apply.apply(SQLContext.scala:1358)
at scala.collection.TraversableLike$$anonfun$map.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows.apply(SQLContext.scala:1358)
at org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows.apply(SQLContext.scala:1356)
at scala.collection.Iterator$$anon.next(Iterator.scala:328)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:263)
... 8 more
您收到错误消息是因为 Spark 1.6 版本不支持嵌套的 JavaBeans。请参阅 https://spark.apache.org/docs/1.6.0/sql-programming-guide.html#inferring-the-schema-using-reflection
Currently, Spark SQL does not support JavaBeans that contain nested or contain complex types such as Lists or Arrays.
我正在尝试使用 DataFrame/datasets
为特定 class 类型编写镶木地板 read/write classclass 架构:
class A {
long count;
List<B> listOfValues;
}
class B {
String id;
long count;
}
代码:
String path = "some path";
List<A> entries = somerandomAentries();
JavaRDD<A> rdd = sc.parallelize(entries, 1);
DataFrame df = sqlContext.createDataFrame(rdd, A.class);
df.write().parquet(path);
DataFrame newDataDF = sqlContext.read().parquet(path);
newDataDF.show();
当我尝试 运行 时,它会抛出一个错误。我在这里错过了什么?创建数据框时是否需要为整个 class 提供架构 错误:
Caused by: scala.MatchError: B(Id=abc, count=0) (of class B)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:169)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter.apply(CatalystTypeConverters.scala:401)
at org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$$anonfun$apply.apply(SQLContext.scala:1358)
at org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$$anonfun$apply.apply(SQLContext.scala:1358)
at scala.collection.TraversableLike$$anonfun$map.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows.apply(SQLContext.scala:1358)
at org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows.apply(SQLContext.scala:1356)
at scala.collection.Iterator$$anon.next(Iterator.scala:328)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:263)
... 8 more
您收到错误消息是因为 Spark 1.6 版本不支持嵌套的 JavaBeans。请参阅 https://spark.apache.org/docs/1.6.0/sql-programming-guide.html#inferring-the-schema-using-reflection
Currently, Spark SQL does not support JavaBeans that contain nested or contain complex types such as Lists or Arrays.