使用 gensim 访问 docvectors 时出现问题

Question

我正在尝试使用 gensim（1.0.1 版）doc2vec 来获取文档的余弦相似度。这应该相对简单，但我在检索文档向量时遇到问题，因此我可以做余弦相似度。当我尝试通过我在训练中给它的标签检索文档时，我得到一个关键错误。

例如， print(model.docvecs['4_99.txt']) 会告诉我没有 4_99.txt.

这样的键

但是，如果我打印 print(model.docvecs.doctags)，我会看到如下内容： '4_99.txt_3': Doctag(offset=1644, word_count=12, doc_count=1)

因此，对于每个文档，doc2vec 都将每个句子保存为 "document name underscore number"

所以我要么 A）训练不正确或 B) 不明白如何检索文档向量以便我可以做 similarity(d1, d2)

有人可以帮我吗？

这是我训练 doc2vec 的方法：

#Obtain txt abstracts and txt patents 
filedir = os.path.abspath(os.path.join(os.path.dirname(__file__)))
files = os.listdir(filedir)

#Doc2Vec takes [['a', 'sentence'], 'and label']
docLabels = [f for f in files if f.endswith('.txt')]

sources = {}  #{'2_139.txt': '2_139.txt'}
for lable in docLabels:
    sources[lable] = lable
sentences = LabeledLineSentence(sources)


model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)
model.build_vocab(sentences.to_array())
for epoch in range(10):
    model.train(sentences.sentences_perm())

model.save('./a2v.d2v')

这个用这个class

class LabeledLineSentence(object):

def __init__(self, sources):
    self.sources = sources

    flipped = {}

    # make sure that keys are unique
    for key, value in sources.items():
        if value not in flipped:
            flipped[value] = [key]
        else:
            raise Exception('Non-unique prefix encountered')

def __iter__(self):
    for source, prefix in self.sources.items():
        with utils.smart_open(source) as fin:
            for item_no, line in enumerate(fin):
                yield LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no])

def to_array(self):
    self.sentences = []
    for source, prefix in self.sources.items():
        with utils.smart_open(source) as fin:
            for item_no, line in enumerate(fin):
                self.sentences.append(LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no]))
    return self.sentences

def sentences_perm(self):
    shuffle(self.sentences)
    return self.sentences

我从网络教程 (https://medium.com/@klintcho/doc2vec-tutorial-using-gensim-ab3ac03d3a1) 中得到了这个 class 来帮助我解决 Doc2Vec 奇怪的数据格式要求，老实说我并不完全理解它。看起来这里写的 class 是为每个句子添加 _n，但在教程中，他们似乎仍然只通过给它文件名来检索文档向量...那么我是什么这里做错了吗？

Answer 1

gensim Doc2Vec class 完全使用您在训练期间传递给它的文档 'tags' 作为文档向量的键。

是的，LabeledLineSentence class 正在将 _n 添加到文档标签中。具体来说，这些似乎是相关文件中的行号。

因此，如果您真正想要的是每行一个向量，则您必须使用训练期间提供的相同键和 _n 来请求向量。

如果您希望每个文件都是自己的文档，则需要更改语料库 class 以将整个文件用作文档。查看您引用的教程，他们似乎有第二个 LabeledLineSentence class 不是面向行的（但仍然以这种方式命名），但是您没有使用该变体。

另外，不需要多次循环调用train()，手动调整alpha。在任何最新版本的 gensim 中，这几乎肯定不是您想要的，其中 train() 已经多次迭代语料库。在最新版本的 gensim 中，如果您这样调用它，甚至会出现错误，因为网络上许多过时的示例都助长了这种错误。

只需调用 train() 一次——它会按照构建模型时指定的次数遍历您的语料库。（这是默认值 5，但可以通过 iter 初始化参数进行控制。并且，10 或更多对于 Doc2Vec 语料库很常见。）

使用 gensim 访问 docvectors 时出现问题

Problems accessing docvectors with gensim

gensim

doc2vec