将余弦相似度数组导出为带标签的矩阵
Export Cosine Simularity Array out as a Matrix with Labels
简短版本: 我有一个数组,需要创建一个矩阵,但在顶部和侧面都有名称标签,并像示例 csv 一样导出。 (抱歉,如果措辞不正确)
长版:
我自学了一个推荐系统,并在一年的隔离学习和故障排除后准备了一个网站,所以通常几天的搜索我就弄明白了,但这让我现在被困了大约 3 周。
推荐系统系统在 python 中工作 我可以输入一个名字,它吐出推荐的名字我调整它并得到可接受的结果。但是在书籍、网站和教程以及 udemy 类 等中。永远不要学习如何使用 python 并制作 Django 站点以使其正常工作。
当前的输出是这样的
# creating a Series for the name of the character so they are associated to an ordered numerical
# list I will use later to match the indexes
indices = pd.Series(df.index)
indices[:5]
# instantiating and generating the count matrix
count = CountVectorizer()
count_matrix = count.fit_transform(df['bag_of_words'])
# creating a Series for the name of the character so they are associated to an ordered numerical
# list I will use later to match the indexes
indices = pd.Series(df.index)
indices[:5]
0 ZZ Top
1 Zyan Malik
2 Zooey Deschanel
3 Ziggy Marley
4 ZHU
Name: name, dtype: object
# generating the cosine similarity matrix
cosine_sim = cosine_similarity(count_matrix, count_matrix)
cosine_sim
array([[1. , 0.11708208, 0.10192614, ..., 0. , 0. ,
0. ],
[0.11708208, 1. , 0.1682581 , ..., 0. , 0. ,
0. ],
[0.10192614, 0.1682581 , 1. , ..., 0. , 0. ,
0. ],
...,
[0. , 0. , 0. , ..., 1. , 1. ,
1. ],
[0. , 0. , 0. , ..., 1. , 1. ,
1. ],
[0. , 0. , 0. , ..., 1. , 1. ,
1. ]])
# I need to then export to csv which I understand
.to_csv('artist_similarities.csv')
期望出口
我正在尝试使用我认为称为矩阵的索引名称数组,就像这个例子一样。
scores ZZ Top Zyan Malik Zooey Deschanel ZHU
0 ZZ Top 0 65.61249881 24.04163056 24.06241883
1 Zyan Malik 65.61249881 0 89.35882721 69.6634768
2 Zooey Deschanel 24.04163056 89.40917179 0 20.09975124
3 ZHU 7.874007874 69.6634768 20.09975124 0
# function that takes in the character name as input and returns the top 10 recommended characters
def recommendations(name, cosine_sim = cosine_sim):
recommended_names = []
# getting the index of the movie that matches the title
idx = indices[indices == name].index[0]
# creating a Series with the similarity scores in descending order
score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
# getting the indexes of the 10 most characters
top_10_indexes = list(score_series.iloc[1:11].index)
# populating the list with the names of the best 10 matching characters
for i in top_10_indexes:
recommended_names.append(list(df.index)[i])
return recommended_names
# working results which for dataset are pretty good
recommendations('Blues Traveler')
['G-Love & The Special Sauce',
'Phish',
'Spin Doctors',
'Grace Potter and the Nocturnals',
'Jason Mraz',
'Pearl Jam',
'Dave Matthews Band',
'Lukas Nelson & Promise of the Real ',
'Vonda Shepard',
'Goo Goo Dolls']
我不确定我是否理解你的问题,我无法发表评论,所以我不得不写在这里。我假设您想将列和索引字段添加到 cosine_sim 数组。你可以这样做:
cos_sim_df = pd.DataFrame(cosine_sim, index=indices, columns=indices)
cos_sim_df.to_csv("artist_similarities.csv")
然后像
一样阅读csv
cos_sim_df = pd.read_csv("artist_similarities.csv", header=0, index_col=0)
确保pandas知道第一行和第一列是字段名称。另外我假设你的列和行索引是相同的,你可以根据需要更改它们。另一件事,这不会完全像所需的导出,因为在那个 csv 中有一个包含艺术家姓名的“分数”字段,尽管看起来艺术家应该是字段名称。如果您希望导出的 csv 看起来与所需的导出完全一样,您可以像这样在“分数”字段中添加艺术家:
cos_sim_df = pd.DataFrame(cosine_sim, columns=indices)
cos_sim_df["score"] = indices
# make the score field the first field
cos_sim_df = cos_sim_df[["score", *idx]]
最后我想指出,索引数据帧是行优先的,并且您似乎将字段可视化为列索引,对于这种特定情况,因为您的数组在对角线上有一条对称线,它没有索引哪个轴很重要,因为 cos_sim_df["Zayn Malik"] 无论如何都会 return 相同的值,但如果您的数组不对称,请记住这一点。
简短版本: 我有一个数组,需要创建一个矩阵,但在顶部和侧面都有名称标签,并像示例 csv 一样导出。 (抱歉,如果措辞不正确)
长版: 我自学了一个推荐系统,并在一年的隔离学习和故障排除后准备了一个网站,所以通常几天的搜索我就弄明白了,但这让我现在被困了大约 3 周。
推荐系统系统在 python 中工作 我可以输入一个名字,它吐出推荐的名字我调整它并得到可接受的结果。但是在书籍、网站和教程以及 udemy 类 等中。永远不要学习如何使用 python 并制作 Django 站点以使其正常工作。
当前的输出是这样的
# creating a Series for the name of the character so they are associated to an ordered numerical
# list I will use later to match the indexes
indices = pd.Series(df.index)
indices[:5]
# instantiating and generating the count matrix
count = CountVectorizer()
count_matrix = count.fit_transform(df['bag_of_words'])
# creating a Series for the name of the character so they are associated to an ordered numerical
# list I will use later to match the indexes
indices = pd.Series(df.index)
indices[:5]
0 ZZ Top
1 Zyan Malik
2 Zooey Deschanel
3 Ziggy Marley
4 ZHU
Name: name, dtype: object
# generating the cosine similarity matrix
cosine_sim = cosine_similarity(count_matrix, count_matrix)
cosine_sim
array([[1. , 0.11708208, 0.10192614, ..., 0. , 0. ,
0. ],
[0.11708208, 1. , 0.1682581 , ..., 0. , 0. ,
0. ],
[0.10192614, 0.1682581 , 1. , ..., 0. , 0. ,
0. ],
...,
[0. , 0. , 0. , ..., 1. , 1. ,
1. ],
[0. , 0. , 0. , ..., 1. , 1. ,
1. ],
[0. , 0. , 0. , ..., 1. , 1. ,
1. ]])
# I need to then export to csv which I understand
.to_csv('artist_similarities.csv')
期望出口
我正在尝试使用我认为称为矩阵的索引名称数组,就像这个例子一样。
scores ZZ Top Zyan Malik Zooey Deschanel ZHU
0 ZZ Top 0 65.61249881 24.04163056 24.06241883
1 Zyan Malik 65.61249881 0 89.35882721 69.6634768
2 Zooey Deschanel 24.04163056 89.40917179 0 20.09975124
3 ZHU 7.874007874 69.6634768 20.09975124 0
# function that takes in the character name as input and returns the top 10 recommended characters
def recommendations(name, cosine_sim = cosine_sim):
recommended_names = []
# getting the index of the movie that matches the title
idx = indices[indices == name].index[0]
# creating a Series with the similarity scores in descending order
score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
# getting the indexes of the 10 most characters
top_10_indexes = list(score_series.iloc[1:11].index)
# populating the list with the names of the best 10 matching characters
for i in top_10_indexes:
recommended_names.append(list(df.index)[i])
return recommended_names
# working results which for dataset are pretty good
recommendations('Blues Traveler')
['G-Love & The Special Sauce',
'Phish',
'Spin Doctors',
'Grace Potter and the Nocturnals',
'Jason Mraz',
'Pearl Jam',
'Dave Matthews Band',
'Lukas Nelson & Promise of the Real ',
'Vonda Shepard',
'Goo Goo Dolls']
我不确定我是否理解你的问题,我无法发表评论,所以我不得不写在这里。我假设您想将列和索引字段添加到 cosine_sim 数组。你可以这样做:
cos_sim_df = pd.DataFrame(cosine_sim, index=indices, columns=indices)
cos_sim_df.to_csv("artist_similarities.csv")
然后像
一样阅读csvcos_sim_df = pd.read_csv("artist_similarities.csv", header=0, index_col=0)
确保pandas知道第一行和第一列是字段名称。另外我假设你的列和行索引是相同的,你可以根据需要更改它们。另一件事,这不会完全像所需的导出,因为在那个 csv 中有一个包含艺术家姓名的“分数”字段,尽管看起来艺术家应该是字段名称。如果您希望导出的 csv 看起来与所需的导出完全一样,您可以像这样在“分数”字段中添加艺术家:
cos_sim_df = pd.DataFrame(cosine_sim, columns=indices)
cos_sim_df["score"] = indices
# make the score field the first field
cos_sim_df = cos_sim_df[["score", *idx]]
最后我想指出,索引数据帧是行优先的,并且您似乎将字段可视化为列索引,对于这种特定情况,因为您的数组在对角线上有一条对称线,它没有索引哪个轴很重要,因为 cos_sim_df["Zayn Malik"] 无论如何都会 return 相同的值,但如果您的数组不对称,请记住这一点。