计算余弦相似度 Spark Dataframe
Calculate Cosine Similarity Spark Dataframe
我正在使用 Spark Scala 计算 Dataframe 行之间的余弦相似度。
数据帧格式如下
root
|-- SKU: double (nullable = true)
|-- Features: vector (nullable = true)
下面的数据框示例
+-------+--------------------+
| SKU| Features|
+-------+--------------------+
| 9970.0|[4.7143,0.0,5.785...|
|19676.0|[5.5,0.0,6.4286,4...|
| 3296.0|[4.7143,1.4286,6....|
|13658.0|[6.2857,0.7143,4....|
| 1.0|[4.2308,0.7692,5....|
| 513.0|[3.0,0.0,4.9091,5...|
| 3753.0|[5.9231,0.0,4.846...|
|14967.0|[4.5833,0.8333,5....|
| 2803.0|[4.2308,0.0,4.846...|
|11879.0|[3.1429,0.0,4.5,4...|
+-------+--------------------+
我尝试转置矩阵并查看以下提到的链接。Apache Spark Python Cosine Similarity over DataFrames, calculating-cosine-similarity-by-featurizing-the-text-into-vector-using-tf-idf但我相信有更好的解决方案
我尝试了下面的示例代码
val irm = new IndexedRowMatrix(inClusters.rdd.map {
case (v,i:Vector) => IndexedRow(v, i)
}).toCoordinateMatrix.transpose.toRowMatrix.columnSimilarities
但是我得到以下错误
Error:(80, 12) constructor cannot be instantiated to expected type;
found : (T1, T2)
required: org.apache.spark.sql.Row
case (v,i:Vector) => IndexedRow(v, i)
我检查了以下 Link 但是不能使用 Scala
DataFrame.rdd
returns RDD[Row]
不是 RDD[(T, U)]
。您必须对 Row
进行模式匹配或直接提取感兴趣的部分。
ml
Vector
与 Datasets
一起使用,因为 Spark 2.0 与旧 API 使用的 mllib
Vector
不同。您必须将其转换为与 IndexedRowMatrix
. 一起使用
- 索引必须是
Long
而不是字符串。
import org.apache.spark.sql.Row
val irm = new IndexedRowMatrix(inClusters.rdd.map {
Row(_, v: org.apache.spark.ml.linalg.Vector) =>
org.apache.spark.mllib.linalg.Vectors.fromML(v)
}.zipWithIndex.map { case (v, i) => IndexedRow(i, v) })
我正在使用 Spark Scala 计算 Dataframe 行之间的余弦相似度。
数据帧格式如下
root
|-- SKU: double (nullable = true)
|-- Features: vector (nullable = true)
下面的数据框示例
+-------+--------------------+
| SKU| Features|
+-------+--------------------+
| 9970.0|[4.7143,0.0,5.785...|
|19676.0|[5.5,0.0,6.4286,4...|
| 3296.0|[4.7143,1.4286,6....|
|13658.0|[6.2857,0.7143,4....|
| 1.0|[4.2308,0.7692,5....|
| 513.0|[3.0,0.0,4.9091,5...|
| 3753.0|[5.9231,0.0,4.846...|
|14967.0|[4.5833,0.8333,5....|
| 2803.0|[4.2308,0.0,4.846...|
|11879.0|[3.1429,0.0,4.5,4...|
+-------+--------------------+
我尝试转置矩阵并查看以下提到的链接。Apache Spark Python Cosine Similarity over DataFrames, calculating-cosine-similarity-by-featurizing-the-text-into-vector-using-tf-idf但我相信有更好的解决方案
我尝试了下面的示例代码
val irm = new IndexedRowMatrix(inClusters.rdd.map {
case (v,i:Vector) => IndexedRow(v, i)
}).toCoordinateMatrix.transpose.toRowMatrix.columnSimilarities
但是我得到以下错误
Error:(80, 12) constructor cannot be instantiated to expected type;
found : (T1, T2)
required: org.apache.spark.sql.Row
case (v,i:Vector) => IndexedRow(v, i)
我检查了以下 Link
DataFrame.rdd
returnsRDD[Row]
不是RDD[(T, U)]
。您必须对Row
进行模式匹配或直接提取感兴趣的部分。ml
Vector
与Datasets
一起使用,因为 Spark 2.0 与旧 API 使用的mllib
Vector
不同。您必须将其转换为与IndexedRowMatrix
. 一起使用
- 索引必须是
Long
而不是字符串。
import org.apache.spark.sql.Row
val irm = new IndexedRowMatrix(inClusters.rdd.map {
Row(_, v: org.apache.spark.ml.linalg.Vector) =>
org.apache.spark.mllib.linalg.Vectors.fromML(v)
}.zipWithIndex.map { case (v, i) => IndexedRow(i, v) })