为什么 Word2Vec 函数 returns 我有很多 0.99 的值

Question

我正在尝试在评论数据集上应用 word2vec 模型。首先，我将预处理应用于数据集：

df=df.text.apply(gensim.utils.simple_preprocess)

这是我得到的数据集：

0       [understand, location, low, score, look, mcdon...
3       [listen, it, morning, tired, maybe, hangry, ma...
6       [super, cool, bathroom, door, open, foot, nugg...
19      [cant, find, better, mcdonalds, know, getting,...
27      [night, went, mcdonalds, best, mcdonalds, expe...
                              ...
1677    [mcdonalds, app, order, arrived, line, drive, ...
1693    [correct, order, filled, promptly, expecting, ...
1694    [wow, fantastic, eatery, high, quality, ive, e...
1704    [let, tell, eat, lot, mcchickens, best, ive, m...
1716    [entertaining, staff, ive, come, mcdees, servi...
Name: text, Length: 283, dtype: object

现在我创建 Word2Vec 模型并对其进行训练：

model = gensim.models.Word2Vec(sentences=df, vector_size=200, window=10, min_count=1, workers=6)
model.train(df,total_examples=model.corpus_count,epochs=model.epochs)
print(model.wv.most_similar("service",topn=10))

我不明白的是函数 most_similar() returns 对我有很多 0.99 的相似度。

[('like', 0.9999310970306396), ('mcdonalds', 0.9999251961708069), ('food', 0.9999234080314636), ('order', 0.999918520450592), ('fries', 0.9999175667762756), ('got', 0.999911367893219), ('window', 0.9999082088470459), ('way', 0.9999075531959534), ('it', 0.9999069571495056), ('meal', 0.9999067783355713)]

我做错了什么？

Answer 1

根据 official doc:

Find the top-N most similar words. ... 
This method computes cosine similarity between a simple mean of the projection weight 
vectors of the given words and the vectors for each word in the model. The method 
corresponds to the word-analogy and distance scripts in the original word2vec 
implementation. ...

由于你把这个df作为你的sentence base放在param中，gensim只是计算不同句子（dataframe行）中单词的类比和距离。我不确定您的数据框是否包含“服务”，如果是，则结果词只是在句子中具有最接近“服务”值的词。

Answer 2

你说得对，这不正常。

您的 df 不太可能是 Word2Vec 期望的正确格式。它需要一个 re-iterable Python 序列，其中每个项目都是 list 的 string tokens.

尝试显示 next(iter(df))，以查看 df 中的第一项，如果像 Word2Vec 那样迭代的话。看起来是不是很好的训练数据？

关于您的代码：

min_count=1 对于 Word2Vec 总是一个坏主意 - 稀有词无法获得好的向量，但总的来说，就像随机噪声一样，使附近的词更难训练。通常，不应降低默认值 min_count=5，除非您确定这对您的结果有帮助，因为您可以比较该值与较低值的效果。如果你的词汇量似乎消失了，因为单词出现的次数连少得可怜的 5 次，那么你的数据可能对于这个 data-hungry 算法来说太少了。
只有 283 个文本不太可能是足够的训练数据，除非每个文本都有数万个标记。（即使可以从这个 far-smaller-than-ideal 语料库中提取一些结果，您可能需要缩小 vector_size and/or 增加 epochs 以充分利用最小数据.
如果您在 Word2Vec() 构造中向 sentences 提供语料库，则不需要调用 .train()。它将已经自动完全使用该语料库作为构造函数的一部分。（如果您没有在 construction-time 提供语料库，您只需要调用独立的内部 .build_vocab() 和 .train() 步骤。）

我强烈建议您为相关类（所有 Gensim 或只是 Word2Vec）至少启用 INFO 级别的日志记录。然后你会看到有用的 logging/progress 信息，如果你仔细阅读，这些信息往往会揭示一些问题，比如这里多余的第二次训练。（不过，多余的训练 并不是 你的主要问题的原因。）

为什么 Word2Vec 函数 returns 我有很多 0.99 的值

Why Word2Vec function returns me a lot of 0.99 values

python

gensim

word2vec