(TF-IDF)如何return五篇相关文章计算余弦相似度后
(TF-IDF)How to return the five related article after calculating cosine similarity
我得到一个数据框 sample_df(4 列:paper_id,标题,摘要,body_text)。我提取了摘要栏(每个摘要约 1000 个单词)并应用文本清理过程。这是我的问题:
计算完question和abstract的余弦相似度后,如何returntop5文章得分对应的信息(如paper_id,title,body_text) 因为我的目标是做 tf -idf 问答。
真的很抱歉我的英语不好,我是nlp的新手。如果有人能提供帮助,我将不胜感激。
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity
txt_cleaned = get_cleaned_text(sample_df,sample_df['abstract'])
question = ['Can covid19 transmit through air']
tfidf_vector = TfidfVectorizer()
tfidf = tfidf_vector.fit_transform(txt_cleaned)
tfidf_question = tfidf_vector.transform(question)
cosine_similarities = linear_kernel(tfidf_question,tfidf).flatten()
related_docs_indices = cosine_similarities.argsort()[:-5:-1]
cosine_similarities[related_docs_indices]
#output([0.18986527, 0.18339485, 0.14951123, 0.13441914])
首先:如果您想要 5 篇文章,那么您必须使用 [:-6:-1]
而不是 [:-5:-1]
,因为对于负值,它的工作方式略有不同。
或使用[::-1][:5]
- [::-1]
将反转所有值然后你可以使用正常[:5]
当你有 related_docs_indices
时,你可以使用 .iloc[]
从 DataFrame
中获取元素
sample_df.iloc[ related_docs_indices ]
如果你有相似度相同的元素,那么它会以相反的顺序给出它们。
顺便说一句:
您还可以将 similarities
添加到 DataFrame
sample_df['similarity'] = cosine_similarities
然后排序(反向)得到 5 项。
sample_df.sort_values('similarity', ascending=False)[:5]
如果你有相似度相同的元素,那么它会按原来的顺序给出它们。
带有一些数据的最小工作代码 - 所以每个人都可以复制和测试它。
因为我在DataFrame
中只有5个元素,所以我搜索了2个元素。
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
sample_df = pd.DataFrame({
'paper_id': [1, 2, 3, 4, 5],
'title': ['Covid19', 'Flu', 'Cancer', 'Covid19 Again', 'New Air Conditioners'],
'abstract': ['covid19', 'flu', 'cancer', 'covid19', 'air conditioner'],
'body_text': ['Hello covid19', 'Hello flu', 'Hello cancer', 'Hello covid19 again', 'Buy new air conditioner'],
})
def get_cleaned_text(df, row):
return row
txt_cleaned = get_cleaned_text(sample_df, sample_df['abstract'])
question = ['Can covid19 transmit through air']
tfidf_vector = TfidfVectorizer()
tfidf = tfidf_vector.fit_transform(txt_cleaned)
tfidf_question = tfidf_vector.transform(question)
cosine_similarities = linear_kernel(tfidf_question,tfidf).flatten()
sample_df['similarity'] = cosine_similarities
number = 2
#related_docs_indices = cosine_similarities.argsort()[:-(number+1):-1]
related_docs_indices = cosine_similarities.argsort()[::-1][:number]
print('index:', related_docs_indices)
print('similarity:', cosine_similarities[related_docs_indices])
print('\n--- related_docs_indices ---\n')
print(sample_df.iloc[related_docs_indices])
print('\n--- sort_values ---\n')
print( sample_df.sort_values('similarity', ascending=False)[:number] )
结果:
index: [3 0]
similarity: [0.62791376 0.62791376]
--- related_docs_indices ---
paper_id title abstract body_text similarity
3 4 Covid19 Again covid19 Hello covid19 again 0.627914
0 1 Covid19 covid19 Hello covid19 0.627914
--- sort_values ---
paper_id title abstract body_text similarity
0 1 Covid19 covid19 Hello covid19 0.627914
3 4 Covid19 Again covid19 Hello covid19 again 0.627914
我得到一个数据框 sample_df(4 列:paper_id,标题,摘要,body_text)。我提取了摘要栏(每个摘要约 1000 个单词)并应用文本清理过程。这是我的问题:
计算完question和abstract的余弦相似度后,如何returntop5文章得分对应的信息(如paper_id,title,body_text) 因为我的目标是做 tf -idf 问答。
真的很抱歉我的英语不好,我是nlp的新手。如果有人能提供帮助,我将不胜感激。
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity
txt_cleaned = get_cleaned_text(sample_df,sample_df['abstract'])
question = ['Can covid19 transmit through air']
tfidf_vector = TfidfVectorizer()
tfidf = tfidf_vector.fit_transform(txt_cleaned)
tfidf_question = tfidf_vector.transform(question)
cosine_similarities = linear_kernel(tfidf_question,tfidf).flatten()
related_docs_indices = cosine_similarities.argsort()[:-5:-1]
cosine_similarities[related_docs_indices]
#output([0.18986527, 0.18339485, 0.14951123, 0.13441914])
首先:如果您想要 5 篇文章,那么您必须使用 [:-6:-1]
而不是 [:-5:-1]
,因为对于负值,它的工作方式略有不同。
或使用[::-1][:5]
- [::-1]
将反转所有值然后你可以使用正常[:5]
当你有 related_docs_indices
时,你可以使用 .iloc[]
从 DataFrame
sample_df.iloc[ related_docs_indices ]
如果你有相似度相同的元素,那么它会以相反的顺序给出它们。
顺便说一句:
您还可以将 similarities
添加到 DataFrame
sample_df['similarity'] = cosine_similarities
然后排序(反向)得到 5 项。
sample_df.sort_values('similarity', ascending=False)[:5]
如果你有相似度相同的元素,那么它会按原来的顺序给出它们。
带有一些数据的最小工作代码 - 所以每个人都可以复制和测试它。
因为我在DataFrame
中只有5个元素,所以我搜索了2个元素。
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
sample_df = pd.DataFrame({
'paper_id': [1, 2, 3, 4, 5],
'title': ['Covid19', 'Flu', 'Cancer', 'Covid19 Again', 'New Air Conditioners'],
'abstract': ['covid19', 'flu', 'cancer', 'covid19', 'air conditioner'],
'body_text': ['Hello covid19', 'Hello flu', 'Hello cancer', 'Hello covid19 again', 'Buy new air conditioner'],
})
def get_cleaned_text(df, row):
return row
txt_cleaned = get_cleaned_text(sample_df, sample_df['abstract'])
question = ['Can covid19 transmit through air']
tfidf_vector = TfidfVectorizer()
tfidf = tfidf_vector.fit_transform(txt_cleaned)
tfidf_question = tfidf_vector.transform(question)
cosine_similarities = linear_kernel(tfidf_question,tfidf).flatten()
sample_df['similarity'] = cosine_similarities
number = 2
#related_docs_indices = cosine_similarities.argsort()[:-(number+1):-1]
related_docs_indices = cosine_similarities.argsort()[::-1][:number]
print('index:', related_docs_indices)
print('similarity:', cosine_similarities[related_docs_indices])
print('\n--- related_docs_indices ---\n')
print(sample_df.iloc[related_docs_indices])
print('\n--- sort_values ---\n')
print( sample_df.sort_values('similarity', ascending=False)[:number] )
结果:
index: [3 0]
similarity: [0.62791376 0.62791376]
--- related_docs_indices ---
paper_id title abstract body_text similarity
3 4 Covid19 Again covid19 Hello covid19 again 0.627914
0 1 Covid19 covid19 Hello covid19 0.627914
--- sort_values ---
paper_id title abstract body_text similarity
0 1 Covid19 covid19 Hello covid19 0.627914
3 4 Covid19 Again covid19 Hello covid19 again 0.627914