如何根据文本过滤模型然后使用 most_similar？

Question

我有文本，我想根据文本过滤模型。可以吗？

import pandas as pd
import gensim
import nltk
from nltk import word_tokenize
from nltk.collocations import *
from nltk.stem.wordnet import WordNetLemmatizer
import re

text = "though quite simple room solid choice allocated room already used summer holiday apartment bel endroit nice place place winter"
from gensim.models import Word2Vec,  KeyedVectors

model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz')
model_filter = [w for w in list(model.wv.vocab) if w in text]

如果可以在结果（model_filter）中最相似的函数（modelo_filtrado.most_similar_cosmul????），那些属于文本的如何过滤？谢谢

Answer 1

您的 text 是一个纯字符串。 model 中的单词是单独的单词串。因此，您现有的检查是查看单个单词是否作为 text 中的子字符串 任何地方 出现。

例如，即使 'ice' 没有出现在您的 text 中作为一个词，这将评估为 True:

'ice' in "though quite simple room solid choice allocated room already used summer holiday apartment bel endroit nice place place winter"

您可能想将 text 变成单词列表，首先：

text_words = text.split()

否则，是的，您的代码将仅使用 model 和 in 您的 text（或 text_words).

如何根据文本过滤模型然后使用 most_similar？

How to filter a model with respect to text and then use most_similar?

python

model

nltk

word2vec