ALS 模型 - 如何生成 full_u * v^t * v？

Question

我正在尝试弄清楚 ALS 模型如何在批处理更新期间预测新用户的值。在我的搜索中，我遇到了这个 Whosebug answer。为了 reader 的方便，我复制了下面的答案：

You can get predictions for new users using the trained model (without updating it):

To get predictions for a user in the model, you use its latent representation (vector u of size f (number of factors)), which is multiplied by the product latent factor matrix (matrix made of the latent representations of all products, a bunch of vectors of size f) and gives you a score for each product. For new users, the problem is that you don't have access to their latent representation (you only have the full representation of size M (number of different products), but what you can do is use a similarity function to compute a similar latent representation for this new user by multiplying it by the transpose of the product matrix.

i.e. if you user latent matrix is u and your product latent matrix is v, for user i in the model, you get scores by doing: u_i * v for a new user, you don't have a latent representation, so take the full representation full_u and do: full_u * v^t * v This will approximate the latent factors for the new users and should give reasonable recommendations (if the model already gives reasonable recommendations for existing users)

To answer the question of training, this allows you to compute predictions for new users without having to do the heavy computation of the model which you can now do only once in a while. So you have you batch processing at night and can still make prediction for new user during the day.

Note: MLLIB gives you access to the matrix u and v

上面引用的文字是一个很好的答案，但是，我很难理解如何以编程方式实现这个解决方案。例如矩阵u和v可以用：

# pyspark example

# ommitted for brevity ... loading movielens 1M ratings

model = ALS.train(ratings, rank, numIterations, lambdaParam)

matrix_u = model.userFeatures()

print(matrix_u.take(2)) # take a look at the dataset

这个returns:

[
  (2, array('d', [0.26341307163238525, 0.1650490164756775, 0.118405282497406, -0.5976635217666626, -0.3913084864616394, -0.1379186064004898, -0.3866392970085144, -0.1768060326576233, -0.38342711329460144, 0.48550787568092346, -0.18867433071136475, -0.02757863700389862, 0.1410026103258133, 0.11498363316059113, 0.03958914801478386, 0.034536730498075485, 0.08427099883556366, 0.46969038248062134, -0.8230801224708557, -0.15124185383319855, 0.2566414773464203, 0.04326820373535156, 0.19077207148075104, 0.025207923725247383, -0.02030213735997677, 0.1696728765964508, 0.5714617967605591, -0.03885050490498543, -0.09797532111406326, 0.29186877608299255, -0.12768596410751343, -0.1582849770784378, 0.01933656632900238, -0.09131495654582977, 0.26577943563461304, -0.4543033838272095, -0.11789630353450775, 0.05775507912039757, 0.2891307771205902, -0.2147761881351471, -0.011787488125264645, 0.49508437514305115, 0.5610293745994568, 0.228189617395401, 0.624510645866394, -0.009683617390692234, -0.050237834453582764, -0.07940001785755157, 0.4686132073402405, -0.02288617007434368])), 
  (4, array('d', [-0.001666820957325399, -0.12487432360649109, 0.1252429485321045, -0.794727087020874, -0.3804478347301483, -0.04577340930700302, -0.42346617579460144, -0.27448347210884094, -0.25846347212791443, 0.5107921957969666, 0.04229479655623436, -0.10212298482656479, -0.13407345116138458, -0.2059325873851776, 0.12777331471443176, -0.318756639957428, 0.129398375749588, 0.4351944327354431, -0.9031049013137817, -0.29211774468421936, -0.02933369390666485, 0.023618215695023537, 0.10542935132980347, -0.22032295167446136, -0.1861676126718521, 0.13154461979866028, 0.6130356192588806, -0.10089754313230515, 0.13624103367328644, 0.22037173807621002, -0.2966669499874115, -0.34058427810668945, 0.37738317251205444, -0.3755438029766083, -0.2408779263496399, -0.35355791449546814, 0.05752146989107132, -0.15478627383708954, 0.3418906629085541, -0.6939512491226196, 0.4279302656650543, 0.4875738322734833, 0.5659542083740234, 0.1479463279247284, 0.5280753970146179, -0.24357643723487854, 0.14329688251018524, -0.2137598991394043, 0.011986476369202137, -0.015219110995531082]))
]

我也可以做类似的事情来得到 v 矩阵：

matrix_v = model.productFeatures()

print(matrix_v.take(2)) # take a look at the dataset

这导致：

[
  (2, array('d', [0.019985994324088097, 0.0673416256904602, -0.05697149783372879, -0.5434763431549072, -0.40705952048301697, -0.18632276356220245, -0.30776089429855347, -0.13178342580795288, -0.27466219663619995, 0.4183739423751831, -0.24422742426395416, -0.24130797386169434, 0.24116989970207214, 0.06833088397979736, -0.01750543899834156, 0.03404173627495766, 0.04333991929888725, 0.3577033281326294, -0.7044714689254761, 0.1438472419977188, 0.06652364134788513, -0.029888223856687546, -0.16717877984046936, 0.1027146726846695, -0.12836599349975586, 0.10197233408689499, 0.5053384900093079, 0.019304445013403893, -0.21254844963550568, 0.2705852687358856, -0.04169371724128723, -0.24098040163516998, -0.0683765709400177, -0.09532768279314041, 0.1006036177277565, -0.08682398498058319, -0.13584329187870026, -0.001340558985248208, 0.20587041974067688, -0.14007550477981567, -0.1831497997045517, 0.5021498203277588, 0.3049483597278595, 0.11236990243196487, 0.15783801674842834, -0.044139936566352844, -0.14372406899929047, 0.058535050600767136, 0.3777201473712921, -0.045475270599126816])), 
  (4, array('d', [0.10334215313196182, 0.1881643384695053, 0.09297363460063934, -0.457258403301239, -0.5272660255432129, -0.0989445373415947, -0.2053477019071579, -0.1644461452960968, -0.3771175146102905, 0.21405018866062164, -0.18553146719932556, 0.011830524541437626, 0.29562288522720337, 0.07959598302841187, -0.035378433763980865, -0.11786794662475586, -0.11603366583585739, 0.3776192367076874, -0.5124108791351318, 0.03971947357058525, -0.03365595266222954, 0.023278912529349327, 0.17436474561691284, -0.06317273527383804, 0.05118614062666893, 0.4375131130218506, 0.3281322419643402, 0.036590900272130966, -0.3759073317050934, 0.22429685294628143, -0.0728025734424591, -0.10945595055818558, 0.0728464275598526, 0.014129920862615108, -0.10701996833086014, -0.2496117204427719, -0.09409723430871964, -0.11898282915353775, 0.18940524756908417, -0.3211393356323242, -0.035668935626745224, 0.41765937209129333, 0.2636736035346985, -0.01290816068649292, 0.2824321389198303, 0.021533429622650146, -0.08053319901227951, 0.11117415875196457, 0.22975310683250427, 0.06993964314460754]))
]

但是，我不确定如何从这个进展到 full_u * v^t * v

Answer 1

这个新用户不是矩阵 U，所以你没有它在 'k' 因素中的潜在表示，你只知道它的完整表示，即它的所有评分。 full_u 此处表示所有新用户评分采用 密集格式 （而非稀疏格式 ratings），例如：

[0 2 0 0 0 1 0] 如果用户 u 对项目 2 的评分为 2，对项目 6 的评分为 1。

然后你可以得到 v 就像你做的那样，然后把它变成 numpy 中的矩阵，例如：

pf = model.productFeatures()
Vt = np.matrix(np.asarray(pf.values().collect()))

那么就是乘法的问题了： full_u*Vt*Vt.T

与其他答案相比，

Vt 和 V 被调换，但这只是约定俗成的问题。

请注意，Vt*Vt.T 产品是固定的，因此如果您打算将其用于多个新用户，预先计算它的计算效率会更高。实际上，对于不止一个用户来说，最好将他们所有的评分都放在 bigU 中（格式与我的一个新用户示例相同），然后只做矩阵乘积： bigU*Vt*Vt.T 获取所有新用户的所有评分。可能仍然值得检查产品是否以操作次数方面最有效的方式完成。

Answer 2

只是一个警告。人们谈论用户和产品矩阵就像它们是左右奇异向量。但据我了解，用于查找 U 和 V 的方法是对直线平方误差成本函数的优化，这使得 none 可以保证 SVD 的正交性。

换句话说，用代数的方式思考上述答案的主张。我们有一个完整的评分矩阵 R，一个 n 乘 p 的评分矩阵，其中包含 n 个用户对 p 个产品的评分。我们用 k 个潜在因子分解它以近似 R = UV，其中 U 的行，一个 n x k 矩阵，是潜在用户表示，V 的列，一个 k x p 矩阵，是潜在产品表示。为了在不重新拟合模型的情况下找到全新用户矩阵 R 的潜在用户表示，我们需要：

       R = U V  
R V^{-1} = U V V^{-1}  
R V^{-1} = U I_{k}  
R V^{-1} = U

其中 I_{k} 是 k 维单位矩阵，V^{-1} 是 V 的 p x k 右逆矩阵。上面的提示假设 V^{T} = V^{-1} .这是无法保证的。通常，不能保证假设这是真的会给你任何答案，但不会是无意义的答案。

如果我在 MLLib 的 CF 实现背后的优化方法中遗漏了什么，请告诉我。 ALS 模型中是否有我遗漏的保证正交性的技巧？

ALS 模型 - 如何生成 full_u * v^t * v？

ALS model - how to generate full_u * v^t * v?

apache-spark

apache-spark-ml

apache-spark-mllib