Scala Spark 数据集更改 class 类型
Scala Spark Dataset change class type
我有一个数据框,我将其创建为 MyData1
的架构,然后我创建了一个列,以便新数据框遵循 MyData2
的架构。现在我想 return 将新数据帧作为数据集,但出现以下错误:
[info] org.apache.spark.sql.AnalysisException: cannot resolve '`hashed`' given input columns: [id, description];
[info] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
[info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$$anonfun$apply.applyOrElse(CheckAnalysis.scala:110)
[info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$$anonfun$apply.applyOrElse(CheckAnalysis.scala:107)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp.apply(TreeNode.scala:278)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp.apply(TreeNode.scala:278)
这是我的代码:
import org.apache.spark.sql.{DataFrame, Dataset}
case class MyData1(id: String, description: String)
case class MyData2(id: String, description: String, hashed: String)
object MyObject {
def read(arg1: String, arg2: String): Dataset[MyData2] {
var df: DataFrame = null
val obj1 = new Matcher("cbutrer383", "e8f8chsdfd")
val obj2 = new Matcher("cbutrer383", "g567g4rwew")
val obj3 = new Matcher("cbutrer383", "567yr45e45")
df = Seq(obj1, obj2, obj3).toDF("id", "description")
df.withColumn("hashed", lit("hash"))
val ds: Dataset[MyData2] = df.as[MyData2]
ds
}
}
我知道下一行可能有问题,但我想不通
val ds: Dataset[MyData2] = df.as[MyData2]
我是新手,所以可能犯了一个基本错误。任何人都可以帮忙吗? TIA
您忘记将新创建的 Dataframe 分配给 df
df = df.withColumn("hashed", lit("hash"))
withcolumn
Spark 文档说
Returns a new Dataset by adding a column or replacing the existing
column that has the same name.
你的阅读功能的更好版本如下,
尽量避免 null
赋值,var
,return
语句并不是真正需要的
def read(arg1: String, arg2: String): Dataset[MyData2] = {
val obj1 = new Matcher("cbutrer383", "e8f8chsdfd")
val obj2 = new Matcher("cbutrer383", "g567g4rwew")
val obj3 = new Matcher("cbutrer383", "567yr45e45")
Seq(obj1, obj2, obj3).toDF("id", "description")
.withColumn("hashed", lit("hash"))
.as[MyData2]
}
我有一个数据框,我将其创建为 MyData1
的架构,然后我创建了一个列,以便新数据框遵循 MyData2
的架构。现在我想 return 将新数据帧作为数据集,但出现以下错误:
[info] org.apache.spark.sql.AnalysisException: cannot resolve '`hashed`' given input columns: [id, description];
[info] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
[info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$$anonfun$apply.applyOrElse(CheckAnalysis.scala:110)
[info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$$anonfun$apply.applyOrElse(CheckAnalysis.scala:107)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp.apply(TreeNode.scala:278)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp.apply(TreeNode.scala:278)
这是我的代码:
import org.apache.spark.sql.{DataFrame, Dataset}
case class MyData1(id: String, description: String)
case class MyData2(id: String, description: String, hashed: String)
object MyObject {
def read(arg1: String, arg2: String): Dataset[MyData2] {
var df: DataFrame = null
val obj1 = new Matcher("cbutrer383", "e8f8chsdfd")
val obj2 = new Matcher("cbutrer383", "g567g4rwew")
val obj3 = new Matcher("cbutrer383", "567yr45e45")
df = Seq(obj1, obj2, obj3).toDF("id", "description")
df.withColumn("hashed", lit("hash"))
val ds: Dataset[MyData2] = df.as[MyData2]
ds
}
}
我知道下一行可能有问题,但我想不通
val ds: Dataset[MyData2] = df.as[MyData2]
我是新手,所以可能犯了一个基本错误。任何人都可以帮忙吗? TIA
您忘记将新创建的 Dataframe 分配给 df
df = df.withColumn("hashed", lit("hash"))
withcolumn
Spark 文档说
Returns a new Dataset by adding a column or replacing the existing column that has the same name.
你的阅读功能的更好版本如下,
尽量避免 null
赋值,var
,return
语句并不是真正需要的
def read(arg1: String, arg2: String): Dataset[MyData2] = {
val obj1 = new Matcher("cbutrer383", "e8f8chsdfd")
val obj2 = new Matcher("cbutrer383", "g567g4rwew")
val obj3 = new Matcher("cbutrer383", "567yr45e45")
Seq(obj1, obj2, obj3).toDF("id", "description")
.withColumn("hashed", lit("hash"))
.as[MyData2]
}