我想从 python 中的两个嵌入文档中获取语义相似词的列表

Question

我正在研究 python 中的文本嵌入。我在哪里发现了两个文档与 Doc2vec 模型之间的相似性。代码如下：

for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words) # it takes each document words as a input and produce vector of each document
    sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs)) # it takes list of all document's vector as a input and compare those with the trained vectors and gives the most similarity of 1st document to other and then second to other and so on .
    print('Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
    print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
    for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
        print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

现在，我如何从这两个嵌入的文档中提取这些特定文档的一组语义相似的词。

请帮帮我。

Answer 1

只有一些 Doc2Vec 模式也训练词向量：dm=1（默认），或 dm=0, dbow_words=1（DBOW 文档向量但添加了 skip-gram 词向量。如果你用过这样的模式，那么你的model.wv属性.

就会有词向量

调用 model.wv.similarity(word1, word2) 方法将为您提供任意 2 个单词的成对相似度。

因此，您可以遍历 doc1 中的所有单词，然后收集 doc2 中每个单词的相似度，并报告每个单词的单个最高相似度。

我想从 python 中的两个嵌入文档中获取语义相似词的列表

i want to get a list of semantically similar words from the two embedded documents in python

python

semantic-analysis

doc2vec