在 Apache Spark 中使用 RowMatrix.columnSimilarities 后打印 CoordinateMatrix

Question

我正在将 spark mllib 用于我需要计算文档相似性的项目之一。

我首先使用 mllib 的 tf-idf 转换将文档转换为向量，然后将其转换为 RowMatrix 并使用 columnSimilarities() 方法。

我提到了 tf-idf documentation and used the DIMSUM 余弦相似性的实现。

在 spark-shell 这是执行的 scala 代码：

import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.feature.IDF
import org.apache.spark.mllib.linalg.distributed.RowMatrix

val documents = sc.textFile("test1").map(_.split(" ").toSeq)
val hashingTF = new HashingTF()

val tf = hashingTF.transform(documents)
tf.cache()

val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf)

// now use the RowMatrix to compute cosineSimilarities
// which implements DIMSUM algorithm

val mat = new RowMatrix(tfidf)
val sim = mat.columnSimilarities() // returns a CoordinateMatrix

现在假设我的 input file，此代码块中的 test1 是一个包含 5 个短文档（每个少于 10 个术语）的简单文件，每行一个。

由于我只是在测试这段代码，所以我想查看对象 sim 中 mat.columnSimilarities() 的输出。我想看看第一个文档向量与第二个、第三个等等的相似性。

我为 CoordinateMatrix 引用了 spark documentation，它是由 RowMatrix class 的 columnSimilarities 方法返回并由 [= 引用的对象类型15=].

通过阅读更多文档，我想我可以将 CoordinateMatrix 转换为 RowMatrix，然后将 RowMatrix 的行转换为数组，然后像这样打印 println(sim.toRowMatrix().rows.toArray().mkString("\n"))。

但这给出了一些我无法理解的输出。

有人可以帮忙吗？任何类型的资源链接等都会有很大帮助！

谢谢！

Answer 1

你可以试试下面的方法，不用转成行矩阵格式

val transformedRDD = sim.entries.map{case MatrixEntry(row: Long, col:Long, sim:Double) => Array(row,col,sim).mkString(",")}

要检索元素，您可以调用以下操作

transformedRDD.collect()

在 Apache Spark 中使用 RowMatrix.columnSimilarities 后打印 CoordinateMatrix

print CoordinateMatrix after using RowMatrix.columnSimilarities in Apache Spark

scala

apache-spark

apache-spark-mllib