调整后的余弦相似度无法正常工作
Adjusted cosine similarity not working correctly
我正在开发餐厅之间的 item-based collaborative filter using an adjusted cosine similarity 以生成推荐。我已经设置好一切并且运行良好,但是当我尝试模拟可能的测试场景时,我得到了一些有趣的结果。
我将从我的测试数据开始。我有 2 家餐厅,我想计算它们之间的相似度,以及 3 个用户都对这 2 家餐厅进行了相同的评价。我将使用以下矩阵对其进行解释:
User 1 | User 2 | User 3
Restaurant 1 | 1 | 2 | 1
Restaurant 2 | 1 | 2 | 1
我正在尝试使用以下函数计算相似度:
餐厅在我的代码中被称为 Subject
。
public double ComputeSimilarity(Guid subject1, Guid subject2, IEnumerable<Review> allReviews)
{
//This will create an IEnumerable of reviews from the same user on the 2 restaurants.
var matches = (from R1 in allReviews.Where(x => x.SubjectId == subject1)
from R2 in allReviews.Where(x => x.SubjectId == subject2)
where R1.UserId == R2.UserId
select new { R1, R2 });
double num = 0.0f;
double dem1 = 0.0f;
double dem2 = 0.0f;
//For the similarity between subjects, we use an adjusted cosine similarity.
//More information on this can be found here: http://www10.org/cdrom/papers/519/node14.html
foreach (var item in matches)
{
//First get the average of all reviews the user has given. This is used in the adjusted cosine similarity, read the article from the link for further explanation
double avg = allReviews.Where(x => x.UserId == item.R1.UserId)
.Average(x => x.rating);
num += ((item.R1.rating - avg) * (item.R2.rating - avg));
dem1 += Math.Pow((item.R1.rating - avg), 2);
dem2 += Math.Pow((item.R2.rating - avg), 2);
}
return (num / (Math.Sqrt(dem1) * Math.Sqrt(dem2)));
}
我的评论是这样的:
public class Review
{
public Guid Id { get; set; }
public int rating { get; set; } //This can be an integer between 1-5
public Guid SubjectId { get; set; } //This is the guid of the subject the review has been left on
public Guid UserId { get; set; } //This is the guid of the user who left the review
}
在所有其他情况下,该函数将计算主题之间的正确相似性。但是当我使用上面的测试数据时(我期望完全相似的地方)它会导致 NaN。
这是我的代码错误还是调整后的余弦相似度错误?如果结果是 NaN,抓住它并插入一个 1
来获得相似性是不是很好?
编辑:我也尝试过其他矩阵,我得到了更有趣的结果。
User 1 | User 2 | User 3 | User 4 | User 5
Restaurant 1 | 1 | 2 | 1 | 1 | 2
Restaurant 2 | 1 | 2 | 1 | 1 | 2
结果仍然是 NaN。
User 1 | User 2 | User 3 | User 4 | User 5
Restaurant 1 | 2 | 2 | 1 | 1 | 2
Restaurant 2 | 1 | 2 | 1 | 1 | 2
这导致 -1
的相似性
看来你的算法实现正确。对于完全合理的集合,这个公式在某些点上确实可以是未定义的。您可以将这种情况视为 "this measure (adjusted cosine similarity) has nothing to say about provided sets",因此分配任意值 (0, 1, -1) 是不正确的。相反,在这种情况下使用不同的措施。例如,简单的(未调整的)余弦相似度将给出“1”作为结果,这正是您所期望的。
我正在开发餐厅之间的 item-based collaborative filter using an adjusted cosine similarity 以生成推荐。我已经设置好一切并且运行良好,但是当我尝试模拟可能的测试场景时,我得到了一些有趣的结果。
我将从我的测试数据开始。我有 2 家餐厅,我想计算它们之间的相似度,以及 3 个用户都对这 2 家餐厅进行了相同的评价。我将使用以下矩阵对其进行解释:
User 1 | User 2 | User 3
Restaurant 1 | 1 | 2 | 1
Restaurant 2 | 1 | 2 | 1
我正在尝试使用以下函数计算相似度:
餐厅在我的代码中被称为 Subject
。
public double ComputeSimilarity(Guid subject1, Guid subject2, IEnumerable<Review> allReviews)
{
//This will create an IEnumerable of reviews from the same user on the 2 restaurants.
var matches = (from R1 in allReviews.Where(x => x.SubjectId == subject1)
from R2 in allReviews.Where(x => x.SubjectId == subject2)
where R1.UserId == R2.UserId
select new { R1, R2 });
double num = 0.0f;
double dem1 = 0.0f;
double dem2 = 0.0f;
//For the similarity between subjects, we use an adjusted cosine similarity.
//More information on this can be found here: http://www10.org/cdrom/papers/519/node14.html
foreach (var item in matches)
{
//First get the average of all reviews the user has given. This is used in the adjusted cosine similarity, read the article from the link for further explanation
double avg = allReviews.Where(x => x.UserId == item.R1.UserId)
.Average(x => x.rating);
num += ((item.R1.rating - avg) * (item.R2.rating - avg));
dem1 += Math.Pow((item.R1.rating - avg), 2);
dem2 += Math.Pow((item.R2.rating - avg), 2);
}
return (num / (Math.Sqrt(dem1) * Math.Sqrt(dem2)));
}
我的评论是这样的:
public class Review
{
public Guid Id { get; set; }
public int rating { get; set; } //This can be an integer between 1-5
public Guid SubjectId { get; set; } //This is the guid of the subject the review has been left on
public Guid UserId { get; set; } //This is the guid of the user who left the review
}
在所有其他情况下,该函数将计算主题之间的正确相似性。但是当我使用上面的测试数据时(我期望完全相似的地方)它会导致 NaN。
这是我的代码错误还是调整后的余弦相似度错误?如果结果是 NaN,抓住它并插入一个 1
来获得相似性是不是很好?
编辑:我也尝试过其他矩阵,我得到了更有趣的结果。
User 1 | User 2 | User 3 | User 4 | User 5
Restaurant 1 | 1 | 2 | 1 | 1 | 2
Restaurant 2 | 1 | 2 | 1 | 1 | 2
结果仍然是 NaN。
User 1 | User 2 | User 3 | User 4 | User 5
Restaurant 1 | 2 | 2 | 1 | 1 | 2
Restaurant 2 | 1 | 2 | 1 | 1 | 2
这导致 -1
的相似性
看来你的算法实现正确。对于完全合理的集合,这个公式在某些点上确实可以是未定义的。您可以将这种情况视为 "this measure (adjusted cosine similarity) has nothing to say about provided sets",因此分配任意值 (0, 1, -1) 是不正确的。相反,在这种情况下使用不同的措施。例如,简单的(未调整的)余弦相似度将给出“1”作为结果,这正是您所期望的。