当您有未知数据时,计算质心的标准方法是什么?
What is the standard way of computing centroids when you have unknown data?
我必须计算用户评分的质心。我的数据存储在如下所示的矩阵中(假设我们有 4 个用户和 12 个评分):
[[0,1,0,-1,0,2,3,4,1,0,0,0],
[0,1,1,-1,0,2,3,4,1,0,2,0],
[0,1,0,0,-1,2,3,4,1,0,0,0],
[0,1,-1,2,0,2,3,4,1,4,-1,-1]]
我的问题是我不确定如何处理未知数据,也就是说,当用户没有对所有内容进行评分时(在我的示例中值初始化为 -1)。现在,0 表示用户根本不喜欢该对象,4 表示他们喜欢它。计算质心时,等于-1的值怎么办?现在,我在 python 中的代码如下所示:
def calc_centroid(ratMatrix):
centroid = [0 for x in range(len(ratMatrix[0]))]
for i in range(len(ratMatrix)):
for j in range(len(ratMatrix[i])):
centroid[j] = centroid[j] + ratMatrix[i][j]
count = len(ratMatrix[0])
for i in range(len(centroid)):
centroid[i] = centroid[i]*1.0/count;
return centroid
但是,我没有考虑到 "centroid" 也是使用 -1 值计算的,我想这并不完全正确。这样做的标准方法是什么?
我假设质心是平均值。 4 个评分为 1,您的代码 returns 0.33。我觉得应该是1.
numpy 可以做一些事情使它更整洁。
import numpy as np
def calc_centroid(ratMatrix):
centroid = [0 for x in range(len(ratMatrix[0]))]
for i in range(len(ratMatrix)):
for j in range(len(ratMatrix[i])):
centroid[j] = centroid[j] + ratMatrix[i][j]
count = len(ratMatrix[0])
for i in range(len(centroid)):
centroid[i] = centroid[i]*1.0/count;
return centroid
def calc_centroid2(ratMatrix):
mean_ratings = []
for i in range(ratMatrix.shape[1]): # iterate columns
col = ratMatrix[:,i]
col = col[col != -1] #exclude unrated
mean_ratings.append(np.mean(col))
return mean_ratings
# 4 users, 12 objects to rate: want the mean rating for each object.
ratMatrix = np.array([[0,1,0 ,-1,0 ,2,3,4,1,0 ,0, 0],
[0,1,1 ,-1,0 ,2,3,4,1,0 ,2, 0],
[0,1,0 ,0 ,-1,2,3,4,1,0 ,0, 0],
[0,1,-1,2 ,0 ,2,3,4,1,4,-1,-1]])
print(ratMatrix)
centroids = calc_centroid(ratMatrix)
print(['{:.2f} '.format(i) for i in centroids])
centroids = calc_centroid2(ratMatrix)
print(['{:.2f} '.format(i) for i in centroids])
这会产生
[[ 0 1 0 -1 0 2 3 4 1 0 0 0]
[ 0 1 1 -1 0 2 3 4 1 0 2 0]
[ 0 1 0 0 -1 2 3 4 1 0 0 0]
[ 0 1 -1 2 0 2 3 4 1 4 -1 -1]]
['0.00 ', '0.33 ', '0.00 ', '0.00 ', '-0.08 ', '0.67 ', '1.00 ', '1.33 ', '0.33 ', '0.33 ', '0.08 ', '-0.08 ']
['0.00 ', '1.00 ', '0.33 ', '1.00 ', '0.00 ', '2.00 ', '3.00 ', '4.00 ', '1.00 ', '1.00 ', '0.67 ', '0.00 ']
我必须计算用户评分的质心。我的数据存储在如下所示的矩阵中(假设我们有 4 个用户和 12 个评分):
[[0,1,0,-1,0,2,3,4,1,0,0,0],
[0,1,1,-1,0,2,3,4,1,0,2,0],
[0,1,0,0,-1,2,3,4,1,0,0,0],
[0,1,-1,2,0,2,3,4,1,4,-1,-1]]
我的问题是我不确定如何处理未知数据,也就是说,当用户没有对所有内容进行评分时(在我的示例中值初始化为 -1)。现在,0 表示用户根本不喜欢该对象,4 表示他们喜欢它。计算质心时,等于-1的值怎么办?现在,我在 python 中的代码如下所示:
def calc_centroid(ratMatrix):
centroid = [0 for x in range(len(ratMatrix[0]))]
for i in range(len(ratMatrix)):
for j in range(len(ratMatrix[i])):
centroid[j] = centroid[j] + ratMatrix[i][j]
count = len(ratMatrix[0])
for i in range(len(centroid)):
centroid[i] = centroid[i]*1.0/count;
return centroid
但是,我没有考虑到 "centroid" 也是使用 -1 值计算的,我想这并不完全正确。这样做的标准方法是什么?
我假设质心是平均值。 4 个评分为 1,您的代码 returns 0.33。我觉得应该是1.
numpy 可以做一些事情使它更整洁。
import numpy as np
def calc_centroid(ratMatrix):
centroid = [0 for x in range(len(ratMatrix[0]))]
for i in range(len(ratMatrix)):
for j in range(len(ratMatrix[i])):
centroid[j] = centroid[j] + ratMatrix[i][j]
count = len(ratMatrix[0])
for i in range(len(centroid)):
centroid[i] = centroid[i]*1.0/count;
return centroid
def calc_centroid2(ratMatrix):
mean_ratings = []
for i in range(ratMatrix.shape[1]): # iterate columns
col = ratMatrix[:,i]
col = col[col != -1] #exclude unrated
mean_ratings.append(np.mean(col))
return mean_ratings
# 4 users, 12 objects to rate: want the mean rating for each object.
ratMatrix = np.array([[0,1,0 ,-1,0 ,2,3,4,1,0 ,0, 0],
[0,1,1 ,-1,0 ,2,3,4,1,0 ,2, 0],
[0,1,0 ,0 ,-1,2,3,4,1,0 ,0, 0],
[0,1,-1,2 ,0 ,2,3,4,1,4,-1,-1]])
print(ratMatrix)
centroids = calc_centroid(ratMatrix)
print(['{:.2f} '.format(i) for i in centroids])
centroids = calc_centroid2(ratMatrix)
print(['{:.2f} '.format(i) for i in centroids])
这会产生
[[ 0 1 0 -1 0 2 3 4 1 0 0 0]
[ 0 1 1 -1 0 2 3 4 1 0 2 0]
[ 0 1 0 0 -1 2 3 4 1 0 0 0]
[ 0 1 -1 2 0 2 3 4 1 4 -1 -1]]
['0.00 ', '0.33 ', '0.00 ', '0.00 ', '-0.08 ', '0.67 ', '1.00 ', '1.33 ', '0.33 ', '0.33 ', '0.08 ', '-0.08 ']
['0.00 ', '1.00 ', '0.33 ', '1.00 ', '0.00 ', '2.00 ', '3.00 ', '4.00 ', '1.00 ', '1.00 ', '0.67 ', '0.00 ']