将余弦相似度数组导出为带标签的矩阵

Export Cosine Simularity Array out as a Matrix with Labels

简短版本: 我有一个数组,需要创建一个矩阵,但在顶部和侧面都有名称标签,并像示例 csv 一样导出。 (抱歉,如果措辞不正确)

长版: 我自学了一个推荐系统,并在一年的隔离学习和故障排除后准备了一个网站,所以通常几天的搜索我就弄明白了,但这让我现在被困了大约 3 周。

推荐系统系统在 python 中工作 我可以输入一个名字,它吐出推荐的名字我调整它并得到可接受的结果。但是在书籍、网站和教程以及 udemy 类 等中。永远不要学习如何使用 python 并制作 Django 站点以使其正常工作。

当前的输出是这样的

# creating a Series for the name of the character so they are associated to an ordered numerical
# list I will use later to match the indexes
indices = pd.Series(df.index)
indices[:5]

# instantiating and generating the count matrix

count = CountVectorizer()

count_matrix = count.fit_transform(df['bag_of_words'])

​

# creating a Series for the name of the character so they are associated to an ordered numerical

# list I will use later to match the indexes

indices = pd.Series(df.index)

indices[:5]

0             ZZ Top
1         Zyan Malik
2    Zooey Deschanel
3       Ziggy Marley
4                ZHU
Name: name, dtype: object

# generating the cosine similarity matrix
cosine_sim = cosine_similarity(count_matrix, count_matrix)
cosine_sim

array([[1.        , 0.11708208, 0.10192614, ..., 0.        , 0.        ,
       0.        ],
      [0.11708208, 1.        , 0.1682581 , ..., 0.        , 0.        ,
       0.        ],
      [0.10192614, 0.1682581 , 1.        , ..., 0.        , 0.        ,
       0.        ],
      ...,
      [0.        , 0.        , 0.        , ..., 1.        , 1.        ,
       1.        ],
      [0.        , 0.        , 0.        , ..., 1.        , 1.        ,
       1.        ],
      [0.        , 0.        , 0.        , ..., 1.        , 1.        ,
       1.        ]])

# I need to then export to csv which I understand

.to_csv('artist_similarities.csv')

期望出口

我正在尝试使用我认为称为矩阵的索引名称数组,就像这个例子一样。

              scores             ZZ Top             Zyan Malik             Zooey Deschanel            ZHU
0             ZZ Top             0            65.61249881            24.04163056             24.06241883
1         Zyan Malik             65.61249881             0            89.35882721                69.6634768
2    Zooey Deschanel             24.04163056             89.40917179             0             20.09975124
3                ZHU             7.874007874             69.6634768             20.09975124             0
# function that takes in the character name as input and returns the top 10 recommended characters
def recommendations(name, cosine_sim = cosine_sim):
    
    recommended_names = []
    
    # getting the index of the movie that matches the title
    idx = indices[indices == name].index[0]

    # creating a Series with the similarity scores in descending order
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)

    # getting the indexes of the 10 most characters
    top_10_indexes = list(score_series.iloc[1:11].index)
    
    # populating the list with the names of the best 10 matching characters
    for i in top_10_indexes:
        recommended_names.append(list(df.index)[i])
        
    return recommended_names

# working results which for dataset are pretty good 

recommendations('Blues Traveler')

['G-Love & The Special Sauce',
 'Phish',
 'Spin Doctors',
 'Grace Potter and the Nocturnals',
 'Jason Mraz',
 'Pearl Jam',
 'Dave Matthews Band',
 'Lukas Nelson & Promise of the Real ',
 'Vonda Shepard',
 'Goo Goo Dolls']

我不确定我是否理解你的问题,我无法发表评论,所以我不得不写在这里。我假设您想将列和索引字段添加到 cosine_sim 数组。你可以这样做:

cos_sim_df = pd.DataFrame(cosine_sim, index=indices, columns=indices)
cos_sim_df.to_csv("artist_similarities.csv")

然后像

一样阅读csv
cos_sim_df = pd.read_csv("artist_similarities.csv", header=0, index_col=0)

确保pandas知道第一行和第一列是字段名称。另外我假设你的列和行索引是相同的,你可以根据需要更改它们。另一件事,这不会完全像所需的导出,因为在那个 csv 中有一个包含艺术家姓名的“分数”字段,尽管看起来艺术家应该是字段名称。如果您希望导出的 csv 看起来与所需的导出完全一样,您可以像这样在“分数”字段中添加艺术家:

cos_sim_df = pd.DataFrame(cosine_sim, columns=indices)
cos_sim_df["score"] = indices
# make the score field the first field
cos_sim_df = cos_sim_df[["score", *idx]]

最后我想指出,索引数据帧是行优先的,并且您似乎将字段可视化为列索引,对于这种特定情况,因为您的数组在对角线上有一条对称线,它没有索引哪个轴很重要,因为 cos_sim_df["Zayn Malik"] 无论如何都会 return 相同的值,但如果您的数组不对称,请记住这一点。