了解 Gensim Doc2vec 排名

Understanding Gensim Doc2vec ranking

我使用 gensim 4.0.1 并遵循教程 1 and 2:

from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

texts = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

texts = [t.lower().split() for t in texts]

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(texts)]
model = Doc2Vec(documents, epochs=50, vector_size=5, window=2, min_count=2, workers=4)

new_vector = model.infer_vector("human machine interface".split())


for rank,(doc_id,score) in enumerate(model.dv.most_similar_cosmul(positive=[new_vector])):
        print('{}. {:.5f} [{}] {}'.format(rank, score, doc_id, ' '.join(documents[doc_id].words)))


1. 0.56613 [7] graph minors iv widths of trees and well quasi ordering
2. 0.55941 [6] the intersection graph of paths in trees
3. 0.55061 [2] the eps user interface management system
4. 0.54981 [1] a survey of user opinion of computer system response time
5. 0.52249 [4] relation of user perceived response time to error measurement
6. 0.52240 [8] graph minors a survey
7. 0.49214 [0] human machine interface for lab abc computer applications
8. 0.49016 [3] system and human system engineering testing of eps
9. 0.47899 [5] the generation of random binary unordered trees
​

为什么包含“人机界面”的文档[0]排名这么低(第7位)?是语义泛化的结果还是模型需要调优?是否可以使用更大的语料库教程来获得可重复的结果?

问题与我之前对类似问题的回答相同:

Doc2Vec 需要更多数据才能开始工作。 9 个文本,总共可能有 55 个单词,大约一半的独特单词太小,无法使用此算法显示任何有趣的结果。

Gensim 的一些特定于 Doc2Vec 的测试用例和教程设法从具有 300 个文档的测试数据集(来自文件 lee_background.cor)中挤出一些模糊可理解的相似之处,每个文档只有几百个单词 -好几万个词,其中几千个是独一无二的。但是还是需要降维&提升epochs,结果还是很弱

如果你想从 Doc2Vec 中看到有意义的结果,你应该瞄准数万个文档,理想情况下每个文档有几十个或数百个或单词。

除此之外的一切都会令人失望,并且不能代表该算法旨在处理的任务类型。

有一个使用更大的电影评论数据集(100K 文档)的教程,原始 'Paragraph Vector' 论文中也使用了该数据集:

https://radimrehurek.com/gensim/auto_examples/howtos/run_doc2vec_imdb.html#sphx-glr-auto-examples-howtos-run-doc2vec-imdb-py

有一个基于维基百科(数百万文档)的教程,现在可能需要一些修正才能工作:

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb