word2vec 词汇表中缺少单词

Question

我正在使用来自 here 的 mikolov 实现在我自己的文本语料库上训练 word2vec。即使我将最小计数设置为 1，也并非语料库中的所有独特词都获得向量。是否有任何我可能遗漏的参数，这可能是并非所有独特词都获得向量的原因？还有什么可能的原因？

为了测试 word2vecs 行为，我编写了以下脚本，提供了一个包含 20058 个句子和 278896 个单词的文本文件（所有单词和标点符号都 space 分开，每行一个句子）。

import subprocess


def get_w2v_vocab(path_embs):
    vocab = set()
    with open(path_embs, 'r', encoding='utf8') as f:
        next(f)
        for line in f:
            word = line.split(' ')[0]
            vocab.add(word)
    return vocab - {'</s>'}


def train(path_corpus, path_embs):
    subprocess.call(["./word2vec", "-threads", "6", "-train", path_corpus,
                     "-output", path_embs, "-min-count", "1"])


def get_unique_words_in_corpus(path_corpus):
    vocab = []
    with open(path_corpus, 'r', encoding='utf8') as f:
        for line in f:
            vocab.extend(line.strip('\n').split(' '))
    return set(vocab)

def check_equality(expected, actual):
    if not expected == actual:
        diff = len(expected - actual)
        raise Exception('Not equal! Vocab expected: {}, Vocab actual: {}, Diff: {}'.format(len(expected), len(actual), diff))
    print('Expected vocab and actual vocab are equal.')



def main():
    path_corpus = 'test_corpus2.txt'
    path_embs = 'embeddings.vec'
    vocab_expected = get_unique_words_in_corpus(path_corpus)
    train(path_corpus, path_embs)
    vocab_actual = get_w2v_vocab(path_embs)
    check_equality(vocab_expected, vocab_actual)


if __name__ == '__main__':
    main()

此脚本给出以下输出：

Starting training using file test_corpus2.txt
Vocab size: 33651
Words in train file: 298954
Alpha: 0.000048  Progress: 99.97%  Words/thread/sec: 388.16k  Traceback (most recent call last):
  File "test_w2v_behaviour.py", line 44, in <module>
    main()
  File "test_w2v_behaviour.py", line 40, in main
    check_equality(vocab_expected, vocab_actual)
  File "test_w2v_behaviour.py", line 29, in check_equality
    raise Exception('Not equal! Vocab expected: {}, Vocab actual: {}, Diff: {}'.format(len(expected), len(actual), diff))
Exception: Not equal! Vocab expected: 42116, Vocab actual: 33650, Diff: 17316

Answer 1

只要您使用 Python，您可能希望使用 gensim 包中的 Word2Vec 实现。它可以完成原始 Mikolov/Googleword2vec.c 所做的一切，甚至更多，并且通常具有性能竞争力。

特别是，它不会有任何 UTF-8 编码问题 – 虽然我不确定 Mikolov/Google word2vec.c 是否正确处理 UTF-8。而且，这可能是您的差异的来源。

如果您需要查明差异的根源，我建议：

让你的 get_unique_words_in_corpus() 也 tally/report 其标记化创建的非唯一词的总数。如果这与 word2vec.c 报告的 298954 不同，那么这两个进程显然不是基于对源文件中 'words' 内容的相同基线理解。
找到一些词，或至少一个有代表性的词，你的标记计数预计会出现在最终模型中，但事实并非如此。查看这些文件中的任何共同特征 - 包括文件中的上下文。这可能会揭示为什么这两个计数不同。

同样，我怀疑一些与 UTF-8 相关的东西，或者可能与 word2vec.c 中的其他实现限制（例如最大字长）相关，这些限制在您的 Python 中没有反映出来-基于单词计数。

Answer 2

你可以使用 FastText 而不是 Word2Vec。 FastText 能够通过查看子词信息（字符 ngram）嵌入词汇表外的词。 Gensim还有一个FastText实现，非常好用：

from gensim.models import FastText as ft

model = ft(sentences=training_data,)

word = 'blablabla' # can be out of vocabulary
embedded_word = model[word] # fetches the word embedding

见

word2vec 词汇表中缺少单词

Missing words in word2vec vocabulary

word2vec