具有特征的 Spark 2.0 数据集编码器

Question

我正在构建一个数据集，其中每条记录都映射到一个案例 class（例如 CustomDataEntry 具有原始类型）。

val dataset = spark.read (...) .as[CustomDataEntry]

到目前为止一切顺利

现在我正在编写一个转换器，它采用带有 CustomDataEntry 的 的数据集，进行一些计算并添加一些新列，例如。找到纬度和经度并计算 geohash

我的 CustomDataEntry 现在有一个 property/column (geohash) 这是不是出现在案例 class 中，但出现在数据集中。同样，这工作正常，但似乎不太好，而不是 type safe（如果使用编码器甚至可能的话）。

在我的案例中，我可以将其添加为选项字段 class，但这看起来很乱，不是可组合的。一个更好的方法似乎是我应该在 CustomDataEntry

上混合一些特征

例如

trait Geo{
    val geohash:String
}

然后 return 数据集为

dataset.as[CustomDataEntry with Geo]

这行不通

Error:(21, 10) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. .as[CustomDataEntry with Geo]

答案似乎很明显（不支持，未来版本），但也许我忽略了什么？

Answer 1

恕我直言，目前还没有编码器，但您可以使用 Encoders.kryo[CustomDataEntry with Geo] 作为编码器解决方法。

具有特征的 Spark 2.0 数据集编码器

Spark 2.0 Dataset Encoder with trait

scala

dataset

apache-spark