Spark UnaryTransformer 实现失败并显示 scala.MatchError
Spark UnaryTransformer implementation fails with scala.MatchError
我正在 Spark 1.6.2 中实现一个 UnaryTransformer。使用此界面:
class myUT(override val uid: String) extends UnaryTransformer[Seq[String], Seq[String], myUT] {
...
override protected def createTransformFunc: Seq[String] => Seq[String] = {
_ => _.map(x => x + "s")
}
这编译得很好,但在运行时 returns 我出错了:
17/07/21 22:29:33 WARN TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3, myhost.com.au): scala.MatchError: ArrayBuffer(<contents of my array>) (of class scala.collection.mutable.ArrayBuffer)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:294)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter.apply(CatalystTypeConverters.scala:401)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.Project$$anonfun$$anonfun$apply.apply(basicOperators.scala:51)
at org.apache.spark.sql.execution.Project$$anonfun$$anonfun$apply.apply(basicOperators.scala:49)
接下来我尝试的是替换
_ => _.map(x => x + "s")
与
_ => _
所以,理论上应该是完全没有数据变化!但我得到的错误是:
17/07/21 22:11:59 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, myhost.com.au): scala.MatchError: WrappedArray(<contains of my array>) (of class scala.collection.mutable.WrappedArray$ofRef)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:294)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter.apply(CatalystTypeConverters.scala:401)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.Project$$anonfun$$anonfun$apply.apply(basicOperators.scala:51)
at org.apache.spark.sql.execution.Project$$anonfun$$anonfun$apply.apply(basicOperators.scala:49)
所以看起来传出数据的类型无论如何都会发生变化。我该如何避免这种情况?
更新:接下来我尝试将 .toArray 添加到地图中。现在的错误是:
[error] /sparkprj/src/main/scala/sp_txt.scala:43: polymorphic expression cannot be instantiated to expected type;
[error] found : [B >: String]Array[B]
[error] required: Seq[String]
[error] ).toArray
它可能会添加一些细节,但不会增加我的理解。在查看了几个 mllib UnaryTransformer 示例之后,我倾向于认为这是 Catalyst 中的一个错误。
class myUT 定义中的这一行不正确:
override protected def outputDataType: DataType = new ArrayType(StringType, true)
当我从 String->String 转换器复制这个 class 定义时,我将 DataType 定义为 StringType。我的错。
我正在 Spark 1.6.2 中实现一个 UnaryTransformer。使用此界面:
class myUT(override val uid: String) extends UnaryTransformer[Seq[String], Seq[String], myUT] {
...
override protected def createTransformFunc: Seq[String] => Seq[String] = {
_ => _.map(x => x + "s")
}
这编译得很好,但在运行时 returns 我出错了:
17/07/21 22:29:33 WARN TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3, myhost.com.au): scala.MatchError: ArrayBuffer(<contents of my array>) (of class scala.collection.mutable.ArrayBuffer)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:294)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter.apply(CatalystTypeConverters.scala:401)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.Project$$anonfun$$anonfun$apply.apply(basicOperators.scala:51)
at org.apache.spark.sql.execution.Project$$anonfun$$anonfun$apply.apply(basicOperators.scala:49)
接下来我尝试的是替换
_ => _.map(x => x + "s")
与
_ => _
所以,理论上应该是完全没有数据变化!但我得到的错误是:
17/07/21 22:11:59 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, myhost.com.au): scala.MatchError: WrappedArray(<contains of my array>) (of class scala.collection.mutable.WrappedArray$ofRef)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:294)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter.apply(CatalystTypeConverters.scala:401)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.Project$$anonfun$$anonfun$apply.apply(basicOperators.scala:51)
at org.apache.spark.sql.execution.Project$$anonfun$$anonfun$apply.apply(basicOperators.scala:49)
所以看起来传出数据的类型无论如何都会发生变化。我该如何避免这种情况?
更新:接下来我尝试将 .toArray 添加到地图中。现在的错误是:
[error] /sparkprj/src/main/scala/sp_txt.scala:43: polymorphic expression cannot be instantiated to expected type;
[error] found : [B >: String]Array[B]
[error] required: Seq[String]
[error] ).toArray
它可能会添加一些细节,但不会增加我的理解。在查看了几个 mllib UnaryTransformer 示例之后,我倾向于认为这是 Catalyst 中的一个错误。
class myUT 定义中的这一行不正确:
override protected def outputDataType: DataType = new ArrayType(StringType, true)
当我从 String->String 转换器复制这个 class 定义时,我将 DataType 定义为 StringType。我的错。