从文本文件中提取与输入词最相似的前 N 个词
Extract top N words that are most similar to an input word from a text file
我有一个文本文件,其中包含我使用 BeautifulSoup 提取的网页内容。我需要根据给定的单词从文本文件中找到 N 个相似的单词。过程如下:
- 从中提取文本的网站:https://en.wikipedia.org/wiki/Football
- 提取的文本保存到文本文件中。
- 用户输入一个词,例如:“目标”,我必须显示文本文件中最相似的前 N 个词。
我只从事计算机视觉工作,对 NLP 完全陌生。我目前停留在第 3 步。我已经尝试过 Spacy 和 Gensim,但我的方法根本没有效率。我目前这样做:
for word in ['goal', 'soccer']:
# 1. compute similarity using spacy for each word in the text file with the given word.
# 2. sort them based on the scores and choose the top N-words.
是否有任何其他方法或简单的解决方案来解决这个问题?任何帮助表示赞赏。谢谢!
你可以使用 spacy similarity
方法,它会为你计算标记之间的余弦相似度。为了使用矢量,加载一个带有矢量的模型:
import spacy
nlp = spacy.load("en_core_web_md")
text = "I have a text file that contains the content of a web page that I have extracted using BeautifulSoup. I need to find N similar words from the text file based on a given word. The process is as follows"
doc = nlp(text)
words = ['goal', 'soccer']
# compute similarity
similarities = {}
for word in words:
tok = nlp(word)
similarities[tok.text] ={}
for tok_ in doc:
similarities[tok.text].update({tok_.text:tok.similarity(tok_)})
# sort
top10 = lambda x: {k: v for k, v in sorted(similarities[x].items(), key=lambda item: item[1], reverse=True)[:10]}
# desired output
top10("goal")
{'need': 0.41729581641359625,
'that': 0.4156277030017712,
'to': 0.40102258054859163,
'is': 0.3742535591719576,
'the': 0.3735002888862756,
'The': 0.3735002888862756,
'given': 0.3595024941701789,
'process': 0.35218102758578645,
'have': 0.34597281472837316,
'as': 0.34433650293640194}
注意,(1) 如果您对 gensim
、and/or 感到满意,(2) 有一个 word2vec
模型在您的文本上进行训练,您可以直接执行以下操作:
word2Vec.most_similar(positive=['goal'], topn=10)
我有一个文本文件,其中包含我使用 BeautifulSoup 提取的网页内容。我需要根据给定的单词从文本文件中找到 N 个相似的单词。过程如下:
- 从中提取文本的网站:https://en.wikipedia.org/wiki/Football
- 提取的文本保存到文本文件中。
- 用户输入一个词,例如:“目标”,我必须显示文本文件中最相似的前 N 个词。
我只从事计算机视觉工作,对 NLP 完全陌生。我目前停留在第 3 步。我已经尝试过 Spacy 和 Gensim,但我的方法根本没有效率。我目前这样做:
for word in ['goal', 'soccer']:
# 1. compute similarity using spacy for each word in the text file with the given word.
# 2. sort them based on the scores and choose the top N-words.
是否有任何其他方法或简单的解决方案来解决这个问题?任何帮助表示赞赏。谢谢!
你可以使用 spacy similarity
方法,它会为你计算标记之间的余弦相似度。为了使用矢量,加载一个带有矢量的模型:
import spacy
nlp = spacy.load("en_core_web_md")
text = "I have a text file that contains the content of a web page that I have extracted using BeautifulSoup. I need to find N similar words from the text file based on a given word. The process is as follows"
doc = nlp(text)
words = ['goal', 'soccer']
# compute similarity
similarities = {}
for word in words:
tok = nlp(word)
similarities[tok.text] ={}
for tok_ in doc:
similarities[tok.text].update({tok_.text:tok.similarity(tok_)})
# sort
top10 = lambda x: {k: v for k, v in sorted(similarities[x].items(), key=lambda item: item[1], reverse=True)[:10]}
# desired output
top10("goal")
{'need': 0.41729581641359625,
'that': 0.4156277030017712,
'to': 0.40102258054859163,
'is': 0.3742535591719576,
'the': 0.3735002888862756,
'The': 0.3735002888862756,
'given': 0.3595024941701789,
'process': 0.35218102758578645,
'have': 0.34597281472837316,
'as': 0.34433650293640194}
注意,(1) 如果您对 gensim
、and/or 感到满意,(2) 有一个 word2vec
模型在您的文本上进行训练,您可以直接执行以下操作:
word2Vec.most_similar(positive=['goal'], topn=10)