MongoSpark - 将 bson Document 转换为 Map[String, Double]

Question

在我的 MongoDB 数据库中，我收集了以下文档：

可以看出，每个文档都有一些嵌套文档（Decade、Title、Plot、Genres 等）。这些是我想出的 SparseVectors 的 Map 表示。并且实际上是通过我的其他 Spark 作业生成的。

从表面上看，这类文档无法轻松读入 Spark DataFrame。

我在想如何才能真正将此类文档读入 Dataframe，其中每个子文档不是由 bson Document 表示，而是由简单的 Map[String, Double] 表示。因为这些子文档中的每一个都是绝对任意的并且包含任意数量的数字字段。

有没有办法处理这样的文件？

Answer 1

设法解决了。方法如下：

import spark.implicits._
final case class MovieData(imdbID: Int,
                       Title: Map[Int, Double],
                       Decade: Map[Int, Double],
                       Plot: Map[Int, Double],
                       Genres: Map[Int, Double],
                       Actors: Map[Int, Double],
                       Countries: Map[Int, Double],
                       Writers: Map[Int, Double],
                       Directors: Map[Int, Double],
                       Productions: Map[Int, Double]
                      )

val movieDataDf = MongoSpark
  .load(sc, moviesDataConfig).rdd.map((doc: Document) => {
    MovieData(
      doc.get("imdbID").asInstanceOf[Int],
      documentToMap(doc.get("Title").asInstanceOf[Document]),
      documentToMap(doc.get("Decade").asInstanceOf[Document]),
      documentToMap(doc.get("Plot").asInstanceOf[Document]),
      documentToMap(doc.get("Genres").asInstanceOf[Document]),
      documentToMap(doc.get("Actors").asInstanceOf[Document]),
      documentToMap(doc.get("Countries").asInstanceOf[Document]),
      documentToMap(doc.get("Writers").asInstanceOf[Document]),
      documentToMap(doc.get("Directors").asInstanceOf[Document]),
      documentToMap(doc.get("Productions").asInstanceOf[Document])
    )
}).toDF()

def documentToMap(doc: Document): Map[Int, Double] = {
  doc.keySet().toArray.map(key => {
    (key.toString.toInt, doc.getDouble(key).toDouble)
  }).toMap
}

希望密码是 self-explanatory。一些类型转换和转换完成了这项工作。可能不是最优雅的解决方案。

MongoSpark - 将 bson Document 转换为 Map[String, Double]

MongoSpark - convert bson Document to Map[String, Double]

mongodb

subdocument

apache-spark

spark-dataframe