doc2vec 模型是否提供非词典单词的准确性？

Does doc2vec model give accuracy on non-dictionary words?

我在语料库中有混合词（字典和非字典词）的句子。非词典词与特定领域的词一样重要。我没有对非字典词执行任何 nlp。 doc2vec 模型是否将非词典词与匹配条件中的相同词进行比较？

例如。我正在输入 ['AMDML','release']。这里的 AMDML 是领域特定词。如果我在训练模型中有像 ['AMDML'、'release'、'process'] 或 ['DML'、'release'] 这样的句子，它会匹配相同的词吗？或者只有像 'release' 和 'process' 这样的词在最相似的方法中匹配？

我猜不是；

根据radimrehurek-gensim page which mentioned Le and Mikolov paper（Doc2Vec算法的介绍者），他们将段落向量模型称为Doc2Vec；

In Gensim, we refer to the Paragraph Vector model as Doc2Vec. Which usually outperforms such simple-averaging of Word2Vec vectors. The basic idea is: act as if a document has another floating word-like vector, which contributes to all training predictions, and is updated like other word-vectors, but we will call it a doc-vector. Gensim’s Doc2Vec class implements this algorithm.

所以我猜 Doc2Vec 只是跟随 Word2Vec model/algorithm；据我所知，例如 Word2Vec 模型在其训练语料库中是否有 AMDML 单词，它可以为它生成一个向量；否则它已经知道了这一点并向您显示类似 error: missing word 或至少 returns padding/empty 向量的内容。

我想你需要类似 fasttext 的东西； fasttext 模型总是有任何单词的向量，即使它们不存在于它的训练语料库中；与 word2vec 不同，fasttext 可以从单词的 n-gram 字符中学习，因此您可以通过测量它们的相似度值来找到相似的单词。之后对每个 sentence/doc 平均这些相似性并找到相似性 sentences/docs.

doc2vec 模型是否提供非词典单词的准确性？

Does doc2vec model give accuracy on non-dictionary words?

python

gensim

doc2vec