在 pandas 数据框中查找房屋之间的相似之处以进行内容过滤

Finding Similarities Between houses in pandas dataframe for content filtering

我想对房屋应用内容过滤。我想找到推荐的每个房屋的相似度分数。我可以为一号房推荐什么?所以我需要房屋的相似矩阵。我怎样才能找到它?

谢谢

    data = [['house1',100,1500,'gas','3+1']
    ,['house2',120,2000,'gas','2+1']
    ,['house3',40,1600,'electricity','1+1']
    ,['house4',110,1450,'electricity','2+1']
    ,['house5',140,1200,'electricity','2+1']
    ,['house6',90,1000,'gas','3+1']
    ,['house7',110,1475,'gas','3+1']
   ]

     Create the pandas DataFrame 
    df = pd.DataFrame(data, columns = 
    ['house','size','price','heating_type','room_count']) 

如果我们根据数值的绝对差异来定义相似性,而对于字符串,则通过 SequenceMatcher 计算的相似率来定义相似性(或者更准确地说是 1 - 比率以使其与差异相比较),我们可以将这些操作应用于相应的列,然后将结果规范化到 0 ... 1 的范围,其中 1 表示(几乎)相等,0 表示最小相似性。将各个列相加,我们得到与总相似度最高的房子最相似的房子。

from difflib import SequenceMatcher

df = df.set_index('house')

res = pd.DataFrame(df[['size','price']].sub(df.loc['house1',['size','price']]).abs())
res['heating_type'] = df.heating_type.apply(lambda x: 1 - SequenceMatcher(None, df.heating_type[0], x).ratio())
res['room_count'] = df.room_count.apply(lambda x: 1 - SequenceMatcher(None, df.room_count[0], x).ratio())
res['total'] = res['size'] + res.price + res.heating_type + res.room_count
res = 1 - res / res.max()

print(res)
print('\nBest match of house1 is ' + res.total[1:].idxmax())

结果:

            size  price  heating_type  room_count     total
house                                                      
house1  1.000000   1.00           1.0         1.0  1.000000
house2  0.666667   0.00           1.0         0.0  0.000000
house3  0.000000   0.80           0.0         0.0  0.689942
house4  0.833333   0.90           0.0         0.0  0.882127
house5  0.333333   0.40           0.0         0.0  0.344010
house6  0.833333   0.00           1.0         1.0  0.019859
house7  0.833333   0.95           1.0         1.0  0.932735

Best match of house1 is house7