在 2 python-lists 中查找具有共同属性的 Objects 的有效方法

Question

标题可能令人困惑 - 我将尝试解释我想做的事情：我正在研究 computer-science 并尝试实施一个小的 movie-recommender 作为我的讲座 "Data-Warehousing & Data Mining" 的项目。现在我正在尝试根据他们的 movie-ratings.

来计算 2 个用户的相似度

class Rating(Model): 
    def __init__(self, userID, movieID, rating):
          ...

我覆盖了 __eq__, __ne__ and __hash__ 的评级，但只考虑了 movieID，以便可以创建一组 2 个用户的评级，以查找他们都已评级的电影。

def similarity(userA, userB):
    ratingsA = userA.ratings
    ratingsB = userB.ratings
    common_ratings = set((ratingsA, ratingsB))

我现在想要的是像下面这样的东西： 2 列表按相同顺序排序，以便可以计算 cosin-distance 的 users/their 评分。

[Rating(userID=1, movieID=4, rating=4.7), Rating(user=1, movie=7, rating=9.8)]
[Rating(userID=2, movieID=4, rating=2.0), Rating(user=2, movie=7, rating=6.6)]

我真的不喜欢我的方法，但过去几个小时我找不到更好的方法。

另一种效率较低的方式（我认为？）是这样的：

lA = []
lB = []
for rA in ratingsA:
    for rB in ratingsB: 
        if rA.movieID == rB.movieID:
            lA.append(rA)
            lB.append(rB)
sim = cos_dist(lA, lB)

这种方法可能会奏效，但我猜运行时间会很糟糕，因为大约有 40000 部电影，而且 2 位用户对同一部电影评分的可能性非常低...

也许有人有有效的方法？提前致谢！

Answer 1

你的方法是 O(N^2) 最坏的情况。您可以将复杂度降低到 O(N log N) 排序评级列表：

sorted_ratingsA = sorted(ratingsA, lambda x: x.movieID)
sorted_ratingsB = sorted(ratingsB, lambda x: x.movieID)

现在我们可以从最后一个列表中弹出这些列表中的项目（出于效率原因）并使用 movieID 上的顺序来检查用户是否对某个 id 进行了评分。大致如下：

lA = []
lB = []
maxA = sorted_ratingsA.pop()
maxB = sorted_ratingsB.pop()
while sorted_ratingsA and sorted_ratingsB:
    if maxA.movieID == maxB.movieID:
        lA.append(maxA)
        lb.append(maxB)
        # instead of the following two pop calls you could simply
        # change the elif into a new if statement.
        maxA = sorted_ratingsA.pop()
        maxB = sorted_ratingsB.pop()
    elif maxA < maxB:
        maxB = sorted_ratingsB.pop()
    else:
        maxA = sorted_ratingsA.pop()

如您所见，弹出包含最大值的列表，直到找到相等的 id 或直到 id 低于该 id，在这种情况下，您开始从另一个列表弹出。列表按升序排列的事实意味着您可以在 O(N log N) 中找到所有匹配项。

使用 pop() 是必不可少的，因为弹出 list 的结尾需要 amortized O(1) 时间，而使用 pop(0) 平均每个 pop 的成本为 O(N)，并会重新引入 O(N^2) 因子。

另一种方法是简单地使用散列，这应该使您的平均时间为 O(N)。您首先创建两个从 movieID 到 ratings 的映射，然后将这些映射相交：

mapA = {x.movieId: x for x in ratingsA}
mapB = {x.movieId: x for x in ratingsB}
common_keys = mapA.keys() & mapB.keys()

lA = [mapA[k] for k in common_keys]
lB = [mapB[k] for k in common_keys]

如果您使用 python<3.x 将 keys() 替换为 viewkeys()。

注意：即使此解决方案使用散列，lA 和 lB 的顺序也会匹配，因为 set 上的迭代顺序仅在修改集合时才会更改，因此这两个上面的迭代检索相应的评级。然而，评级本身的顺序并未定义（因此您不知道 movieID 出现的顺序，但您知道它将在 lA 和 lB 之间匹配） .

你没有在你的问题中提到 SQL，无论如何，如果这些对象在 SQL 数据库中，最好让数据库为你搜索。您可能有一个包含多个字段的 rankings table，并且您想要执行以下操作：

SELECT * FROM rankings
JOIN rankings AS rankings2
ON rankings.movieID = rankings2.movieID

在 2 python-lists 中查找具有共同属性的 Objects 的有效方法

Efficient way to to find Objects with common attributes in 2 python-lists

python

performance

similarity

set