Gensim Word2Vec 最相似的不同结果 python

Question

我有第一本 txt 格式的哈利波特书。由此，我创建了两个新的 txt 文件：第一个，所有出现的 Hermione 都被替换为 Hermione_1；在第二个中，所有出现的 Hermione 都被替换为 Hermione_2。然后我将这 2 个文本连接起来创建一个长文本，并将其用作 Word2Vec 的输入。这是我的代码：

import os
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

with open("HarryPotter1.txt", 'r') as original, \
        open("HarryPotter1_1.txt", 'w') as mod1, \
        open("HarryPotter1_2.txt", 'w') as mod2:

    data=original.read()
    data_1 = data.replace("Hermione", 'Hermione_1')
    data_2 = data.replace("Hermione", 'Hermione_2')
    mod1.write(data_1 + r"\n")
    mod2.write(data_2 + r"\n")

with open("longText.txt",'w') as longFile:
    with open("HarryPotter1_1.txt",'r') as textfile:
        for line in textfile:
            longFile.write(line)
    with open("HarryPotter1_2.txt",'r') as textfile:
        for line in textfile:
            longFile.write(line)


model = ""
word_vectors = ""
modelName = "ModelTest"
vectorName = "WordVectorsTestst"

answer2 = raw_input("Overwrite  embeddig? (yes or n)")
if(answer2 == 'yes'):
    with open("longText.txt",'r') as longFile:
        sentences = []
        single= []
        for line in longFile:
            for word in line.split(" "):
                single.append(word)
            sentences.append(single)

    model = Word2Vec(sentences,workers=4, window=5,min_count=5)

    model.save(modelName)
    model.wv.save_word2vec_format(vectorName+".bin",binary=True)
    model.wv.save_word2vec_format(vectorName+".txt", binary=False)
    model.wv.save(vectorName)

    word_vectors = model.wv

else:
    model = Word2Vec.load(modelName)
    word_vectors = KeyedVectors.load_word2vec_format(vectorName + ".bin", binary=True)

    print(model.wv.similarity("Hermione_1","Hermione_2"))
    print(model.wv.distance("Hermione_1","Hermione_2"))
    print(model.wv.most_similar("Hermione_1"))
    print(model.wv.most_similar("Hermione_2"))

model.wv.most_similar("Hermione_1") 和 model.wv.most_similar("Hermione_2") 怎么可能给我不同的输出？他们的邻居完全不同。这是四次打印的输出：

0.00799602753634
0.992003972464
[('moments,', 0.3204237222671509), ('rose;', 0.3189219534397125), ('Peering', 0.3185565173625946), ('Express,', 0.31800806522369385), ('no...', 0.31678506731987), ('pushing', 0.3131707012653351), ('triumph,', 0.3116190731525421), ('no', 0.29974159598350525), ('them?"', 0.2927379012107849), ('first.', 0.29270970821380615)]
[('go?', 0.45812922716140747), ('magical', 0.35565727949142456), ('Spells."', 0.3554503619670868), ('Scabbets', 0.34701400995254517), ('cupboard."', 0.33982667326927185), ('dreadlocks', 0.3325180113315582), ('sickening', 0.32789379358291626), ('First,', 0.3245708644390106), ('met', 0.3223033547401428), ('built', 0.3218075931072235)]

Answer 1

训练 word2Vec 模型在一定程度上是随机的。这就是为什么您可能会得到不同的结果。此外，Hermione_2 开始出现在文本数据的后半部分。根据我对处理数据过程的理解，当 Hermione_1 上下文已经建立时，这个词的向量也是如此，你在完全相同的上下文中引入第二个词，算法试图找出两者的区别。其次，您使用了一个非常短的向量，这可能 under-represent 概念的复杂性 space。由于简化，您得到两个没有任何重叠的向量。

Gensim Word2Vec 最相似的不同结果 python

Gensim Word2Vec most similar different result python

python

string

gensim

word2vec

word-embedding