我怎样才能匹配 python 中的所有键值对,其中 运行 太长了

how can I match all the key value pair in python which running too long

用户-项目亲和力和推荐:
我正在创建一个 table,它建议 "customers who bought this item also bought algorithm "
输入数据集

productId   userId
Prod1        a
Prod1        b
Prod1        c
Prod1        d
prod2        b
prod2        c
prod2        a
prod2        b
prod3        c
prod3        a
prod3        d
prod3        c
prod4        a
prod4        b
prod4        d
prod4        a
prod5        d
prod5        a

需要输出

Product1    Product2    score
Prod1       prod3
Prod1       prod4
Prod1       prod5
prod2       Prod1
prod2       prod3
prod2       prod4
prod2       prod5
prod3       Prod1
prod3       prod2
Using code : 
#Get list of unique items
itemList=list(set(main["productId"].tolist()))

#Get count of users
userCount=len(set(main["productId"].tolist()))

#Create an empty data frame to store item affinity scores for items.
itemAffinity= pd.DataFrame(columns=('item1', 'item2', 'score'))
rowCount=0

#For each item in the list, compare with other items.
for ind1 in range(len(itemList)):

    #Get list of users who bought this item 1.
    item1Users = main[main.productId==itemList[ind1]]["userId"].tolist()
    #print("Item 1 ", item1Users)

    #Get item 2 - items that are not item 1 or those that are not analyzed already.
    for ind2 in range(ind1, len(itemList)):

        if ( ind1 == ind2):
            continue

        #Get list of users who bought item 2
        item2Users=main[main.productId==itemList[ind2]]["userId"].tolist()
        #print("Item 2",item2Users)

        #Find score. Find the common list of users and divide it by the total users.
        commonUsers= len(set(item1Users).intersection(set(item2Users)))
        score=commonUsers / userCount

        #Add a score for item 1, item 2
        itemAffinity.loc[rowCount] = [itemList[ind1],itemList[ind2],score]
        rowCount +=1
        #Add a score for item2, item 1. The same score would apply irrespective of the sequence.
        itemAffinity.loc[rowCount] = [itemList[ind2],itemList[ind1],score]
        rowCount +=1

#Check final result
itemAffinity

代码在示例数据集上 运行 非常好,但是
在包含 100,000 行的数据集中,代码花费的时间太长 运行。请帮我优化代码。

这里的关键是创建productId的笛卡尔积。请参阅下面的代码,

方法 1(适用于较小的数据集)

result=(main.drop_duplicates(['productId','userId'])
            .assign(cartesian_key=1)
            .pipe(lambda x:x.merge(x,on='cartesian_key'))
            .drop('cartesian_key',axis=1)
            .loc[lambda x:(x.productId_x!=x.productId_y) & (x.userId_x==x.userId_y)]
            .groupby(['productId_x','productId_y']).size()
            .div(data['userId'].nunique()))

result

Prod1   prod2   0.75
Prod1   prod3   0.75
Prod1   prod4   0.75
Prod1   prod5   0.5
prod2   Prod1   0.75
prod2   prod3   0.5
prod2   prod4   0.5
prod2   prod5   0.25
prod3   Prod1   0.75
prod3   prod2   0.5
prod3   prod4   0.5
prod3   prod5   0.5
prod4   Prod1   0.75
prod4   prod2   0.5
prod4   prod3   0.5
prod4   prod5   0.5
prod5   Prod1   0.5
prod5   prod2   0.25
prod5   prod3   0.5
prod5   prod4   0.5

方法二

result = (df.groupby(['productId','userId']).size()
            .clip(upper=1)
            .unstack()
            .assign(key=1)
            .reset_index()
            .pipe(lambda x:x.merge(x,on='key'))
            .drop('key',axis=1)
            .loc[lambda x:(x.productId_x!=x.productId_y)]
            .set_index(['productId_x','productId_y'])
            .pipe(lambda x:x.set_axis(x.columns.str.split('_',expand=True),axis=1,inplace=False))
            .swaplevel(axis=1)
            .pipe(lambda x:(x['x']+x['y']))
            .fillna(0)
            .div(2) 
            .mean(axis=1))

是的,算法可以改进。您正在多次为内部循环中的项目重新计算用户列表。 您可以只在循环外获取项目及其用户的字典。

# get unique items
items = set(main.productId)

n_users = len(set(main.userId))

# make a dictionary of item and users who bought that item
item_users = main.groupby('productId')['userId'].apply(set).to_dict()

# iterate over combinations of item1 and item2 and store scores
result = []
for item1, item2 in itertools.combinations(items, 2):

  score = len(item_users[item1] & item_users[item2]) / n_users
  item_tuples = [(item1, item2), (item2, item1)]
  result.append((item1, item2, score))
  result.append((item2, item1, score)) # store score for reverse order as well

# convert results to a dataframe
result = pd.DataFrame(result, columns=["item1", "item2", "score"])

时间差异:

问题的原始实现

# 3 loops, best of 3: 41.8 ms per loop

马克的方法 2

# 3 loops, best of 3: 19.9 ms per loop

这个答案中的实现

# 3 loops, best of 3: 3.01 ms per loop