如何从自定义 class 人创建数据集?
How to create a Dataset from custom class Person?
我试图在 Java 中创建一个 Dataset
,所以我编写了以下代码:
public Dataset createDataset(){
List<Person> list = new ArrayList<>();
list.add(new Person("name", 10, 10.0));
Dataset<Person> dateset = sqlContext.createDataset(list, Encoders.bean(Person.class));
return dataset;
}
Person
class是一个内在的class。
Spark 然而抛出以下异常:
org.apache.spark.sql.AnalysisException: Unable to generate an encoder for inner class `....` without access to the scope that this class was defined in. Try moving this class out of its parent class.;
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun.applyOrElse(ExpressionEncoder.scala:264)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun.applyOrElse(ExpressionEncoder.scala:260)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun.apply(TreeNode.scala:243)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun.apply(TreeNode.scala:243)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:242)
如何正确操作?
解决方案是在方法的开头添加这段代码:
org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this);
tl;dr(仅在 Spark shell 中)定义您的案例 类 first and, once它们被定义,使用它们。在 Spark/Scala 应用程序中使用案例 类 应该可以正常工作。
在 Spark 2.0.1 中 shell 你应该首先定义 case 类 然后访问它们来创建一个 Dataset
。
$ ./bin/spark-shell --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0-SNAPSHOT
/_/
Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_102
Branch master
Compiled by user jacek on 2016-10-25T04:20:04Z
Revision 483c37c581fedc64b218e294ecde1a7bb4b2af9c
Url https://github.com/apache/spark.git
Type --help for more information.
$ ./bin/spark-shell
scala> :pa
// Entering paste mode (ctrl-D to finish)
case class Person(id: Long)
Seq(Person(0)).toDS // <-- this won't work
// Exiting paste mode, now interpreting.
<console>:15: error: value toDS is not a member of Seq[Person]
Seq(Person(0)).toDS // <-- it won't work
^
scala> case class Person(id: Long)
defined class Person
scala> // the following implicit conversion *will* work
scala> Seq(Person(0)).toDS
res1: org.apache.spark.sql.Dataset[Person] = [id: bigint]
在 43ebf7a9cbd70d6af75e140a6fc91bf0ffc2b877 提交(3 月 21 日的 Spark 2.0.0-SNAPSHOT)中添加了解决方案以解决该问题。
在 Scala REPL 中,我必须添加 OuterScopes.addOuterScope(this)
而 :paste
完整代码段如下:
scala> :pa
// Entering paste mode (ctrl-D to finish)
import sqlContext.implicits._
case class Token(name: String, productId: Int, score: Double)
val data = Token("aaa", 100, 0.12) ::
Token("aaa", 200, 0.29) ::
Token("bbb", 200, 0.53) ::
Token("bbb", 300, 0.42) :: Nil
org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this)
val ds = data.toDS
对于 scala 中的类似问题,我的解决方案是完全按照 AnalysisException 的建议进行操作。将案例 class 移出其父案例 class。
例如,我在 Streaming_Base.scala:
中有类似下面的内容
abstract class Streaming_Base {
case class EventBean(id:String, command:String, recordType:String)
...
}
我将其更改为以下内容:
case class EventBean(id:String, command:String, recordType:String)
abstract class Streaming_Base {
...
}
我试图在 Java 中创建一个 Dataset
,所以我编写了以下代码:
public Dataset createDataset(){
List<Person> list = new ArrayList<>();
list.add(new Person("name", 10, 10.0));
Dataset<Person> dateset = sqlContext.createDataset(list, Encoders.bean(Person.class));
return dataset;
}
Person
class是一个内在的class。
Spark 然而抛出以下异常:
org.apache.spark.sql.AnalysisException: Unable to generate an encoder for inner class `....` without access to the scope that this class was defined in. Try moving this class out of its parent class.;
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun.applyOrElse(ExpressionEncoder.scala:264)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun.applyOrElse(ExpressionEncoder.scala:260)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun.apply(TreeNode.scala:243)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun.apply(TreeNode.scala:243)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:242)
如何正确操作?
解决方案是在方法的开头添加这段代码:
org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this);
tl;dr(仅在 Spark shell 中)定义您的案例 类 first and, once它们被定义,使用它们。在 Spark/Scala 应用程序中使用案例 类 应该可以正常工作。
在 Spark 2.0.1 中 shell 你应该首先定义 case 类 然后访问它们来创建一个 Dataset
。
$ ./bin/spark-shell --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0-SNAPSHOT
/_/
Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_102
Branch master
Compiled by user jacek on 2016-10-25T04:20:04Z
Revision 483c37c581fedc64b218e294ecde1a7bb4b2af9c
Url https://github.com/apache/spark.git
Type --help for more information.
$ ./bin/spark-shell
scala> :pa
// Entering paste mode (ctrl-D to finish)
case class Person(id: Long)
Seq(Person(0)).toDS // <-- this won't work
// Exiting paste mode, now interpreting.
<console>:15: error: value toDS is not a member of Seq[Person]
Seq(Person(0)).toDS // <-- it won't work
^
scala> case class Person(id: Long)
defined class Person
scala> // the following implicit conversion *will* work
scala> Seq(Person(0)).toDS
res1: org.apache.spark.sql.Dataset[Person] = [id: bigint]
在 43ebf7a9cbd70d6af75e140a6fc91bf0ffc2b877 提交(3 月 21 日的 Spark 2.0.0-SNAPSHOT)中添加了解决方案以解决该问题。
在 Scala REPL 中,我必须添加 OuterScopes.addOuterScope(this)
而 :paste
完整代码段如下:
scala> :pa
// Entering paste mode (ctrl-D to finish)
import sqlContext.implicits._
case class Token(name: String, productId: Int, score: Double)
val data = Token("aaa", 100, 0.12) ::
Token("aaa", 200, 0.29) ::
Token("bbb", 200, 0.53) ::
Token("bbb", 300, 0.42) :: Nil
org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this)
val ds = data.toDS
对于 scala 中的类似问题,我的解决方案是完全按照 AnalysisException 的建议进行操作。将案例 class 移出其父案例 class。 例如,我在 Streaming_Base.scala:
中有类似下面的内容abstract class Streaming_Base {
case class EventBean(id:String, command:String, recordType:String)
...
}
我将其更改为以下内容:
case class EventBean(id:String, command:String, recordType:String)
abstract class Streaming_Base {
...
}