MLLib 库中方法 userFeatures 或 productFeatures 的 ALS 模型输出格式是什么？

Question

我有一个这样的评分数据集：(userId,itemId,rating)

我正在尝试使用 ALS 方法构建矩阵分解模型，通过以下代码获取用户潜在特征和产品潜在特征：

object AlsTest {
       def main(args: Array[String])
 {
   System.setProperty("hadoop.home.dir","C:\spark-1.5.1-bin-hadoop2.6\winutil")
   val conf = new SparkConf().setAppName("test").setMaster("local[4]")
   val sc = new SparkContext(conf)

   // Load and parse the data

val data = sc.textFile("ratings.txt")
val ratings = data.map(_.split(" ") match { case Array(user, item, rate) =>
  Rating(user.toInt, item.toInt, rate.toDouble)
})

// Build the recommendation model using ALS
val rank =10
val numIterations =30
val model = ALS.train(ratings, rank, numIterations, 0.01)


val a = model.productFeatures().cache().collect.foreach(println)                         //.cache().collect.count()//.collect.foreach(println)

我已经将排名设置为 10，model.productFeatures() 的输出格式应该是 RDD:[(int,Array[Double])] 但是当我看到输出时有一些问题，输出中有一些字符（这些字符是什么）并且记录中数组元素的数量不同，这些是潜在特征值并且它们在每个记录中的计数也必须相等，这些不是十个，完全相等排名号码。输出是这样的：

(48791,7fea9bb7)
(48795,284b451d)
(48799,3d64767d)
(48803,2f812fc3)
(48807,49d3ea7)
(48811,768cf084)
(48815,6845b7b6)
(48819,4e9c724a)
(48823,23191538)
(48827,3200d90f)
(48831,77bd30fe)
(48839,5a1e0261)
(48843,31c56ccf)
(48855,5b90359)
(48863,1b9de9d0)
(48867,313afdc8)
(48871,2b834c34)
(48875,666d21d6)
(48891,12ca97a2)
(48907,74f8fc8e)
(48911,452becc9)
(48915,4a47062b)
(48919,c76ef46)
(48923,3f596eca)
(48927,258e904c)
(48939,570abc88)
(48947,6c3d75f0)
(48951,18667983)
(48955,493b9633)
(48959,4b579d60)

在矩阵分解中，我们应该构造两个较低维度的矩阵，以便将它们相乘等于评分矩阵：

rating matrix= p*q(transpose), 
p= user latent feature matrix,
q= product latent features matrix,

谁能解释一下spark中als方法的输出格式？

Answer 1

要查看每个产品的潜在因素，请使用以下语法：

model.productFeatures.collect().foreach{case (productID,latentFactors) => println("proID:"+ productID + " factors:"+ latentFactors.mkString(",") )}

给定数据集的结果如下：

proID:1 factors:-1.262960433959961,-0.5678719282150269,1.5220979452133179,2.2127938270568848,-2.096022129058838,3.2418994903564453,0.9077783823013306,1.1294238567352295,-0.0628235936164856,-0.6788621544837952
proID:2 factors:-0.6275356411933899,-2.0269076824188232,1.735855221748352,3.7356512546539307,0.8256714344024658,1.5638374090194702,1.6725327968597412,-1.9434666633605957,0.868758499622345,0.18945524096488953
proID:3 factors:-1.262960433959961,-0.5678719282150269,1.5220979452133179,2.2127938270568848,-2.096022129058838,3.2418994903564453,0.9077783823013306,1.1294238567352295,-0.0628235936164856,-0.6788621544837952
proID:4 factors:-0.6275356411933899,-2.0269076824188232,1.735855221748352,3.7356512546539307,0.8256714344024658,1.5638374090194702,1.6725327968597412,-1.9434666633605957,0.868758499622345,0.18945524096488953

如您所见，每个产品恰好有 10 个因素，根据给定的参数，这是一个正确的数字 val rank =10。

要回答您的第二个问题，请考虑在训练模型后您可以访问两个变量，即 userFeatures: RDD[(Int, Array[Double])] 和 productFeatures: RDD[(Int, Array[Double])]。用户项目矩阵的条目是使用这两个变量的点积确定的。例如，如果您查看 predict 方法的源代码，您可以了解我们如何使用这些变量来预测特定用户对一种产品的评分：

def predict(user: Int, product: Int): Double = {
     val userVector = userFeatures.lookup(user).head
     val productVector = productFeatures.lookup(product).head
     blas.ddot(rank, userVector, 1, productVector, 1)
}

MLLib 库中方法 userFeatures 或 productFeatures 的 ALS 模型输出格式是什么？

What is the output format of ALS model for method userFeatures or productFeatures in MLLib library?

apache-spark

rdd

apache-spark-mllib