Spark - 如何根据项目评分创建稀疏矩阵

Question

我的问题等同于 R-related post Create Sparse Matrix from a data frame，只是我想在 Spark 上执行相同的操作（最好在 Scala).

从中创建稀疏矩阵的 data.txt 文件中的数据样本：

UserID MovieID  Rating
2      1       1
3      2       1
4      2       1
6      2       1
7      2       1

所以最后列是电影 ID，行是用户 ID

    1   2   3   4   5   6   7
1   0   0   0   0   0   0   0
2   1   0   0   0   0   0   0
3   0   1   0   0   0   0   0
4   0   1   0   0   0   0   0
5   0   0   0   0   0   0   0
6   0   1   0   0   0   0   0
7   0   1   0   0   0   0   0

实际上，我已经开始对 data.txt 文件（没有 headers）进行 map RDD 转换，以将值转换为整数，但是...我可以找不到用于创建稀疏矩阵的函数。

val data = sc.textFile("/data/data.txt")
val ratings = data.map(_.split(',') match { case Array(user, item, rate) =>
    Rating(user.toInt, item.toInt, rate.toInt)
  })
...?

Answer 1

最简单的方法是将 Ratings 映射到 MatrixEntries 并创建 CoordinateMatrix:

import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}

val mat = new CoordinateMatrix(ratings.map {
    case Rating(user, movie, rating) => MatrixEntry(user, movie, rating)
})

CoordinateMatrix 可以使用 toBlockMatrix、toIndexedRowMatrix、toRowMatrix 进一步转换为 BlockMatrix、IndexedRowMatrix、RowMatrix分别。

Spark - 如何根据项目评分创建稀疏矩阵

Spark - How to create a sparse matrix from item ratings

scala

recommendation-engine

sparse-matrix

apache-spark