如何找到 spaCy 模型的词汇量？

Question

我试图找到大型英语模型的词汇量，即 en_core_web_lg，我找到了三个不同的信息来源：

spaCy 的文档：685k 键，685k 唯一向量
nlp.vocab.__len__(): 1340242 #（词位数）
len(vocab.strings): 1476045

三者有什么区别？我没能在文档中找到答案。

Answer 1

最有用的数字是与词向量相关的数字。 nlp.vocab.vectors.n_keys 告诉你有多少词向量有词向量，len(nlp.vocab.vectors) 告诉你有多少唯一词向量（多个词向量可以在 md 模型中引用同一个词向量）。

len(vocab) 是缓存词素的数量。在 md 和 lg 模型中，大多数 1340242 词素都有一些预先计算的特征（比如 Token.prob），但是这个缓存中可以有额外的没有预先计算特征的词素，因为更多的条目可以在处理文本时添加。

len(vocab.strings) 是与标记和注释相关的字符串数（如 nsubj 或 NOUN），因此它不是特别有用的数字。在训练或处理中任何地方使用的所有字符串都存储在这里，以便在需要时可以将内部整数哈希值转换回字符串。

Answer 2

自 spaCy 2.3+ 起，根据 release notes，词素未在 nlp.vocab 中加载；所以使用 len(nlp.vocab) 是无效的。相反，使用 nlp.meta['vectors'] 来查找唯一向量和单词的数量。以下是发行说明中的相关部分：

To reduce the initial loading time, the lexemes in nlp.vocab are no longer loaded on initialization for models with vectors. As you process texts, the lexemes will be added to the vocab automatically, just as in small models without vectors.

To see the number of unique vectors and number of words with vectors, see nlp.meta['vectors'], for example for en_core_web_md there are 20000 unique vectors and 684830 words with vectors:
{
    'width': 300,
    'vectors': 20000,
    'keys': 684830,
    'name': 'en_core_web_md.vectors'
}

如何找到 spaCy 模型的词汇量？

How to find the vocabulary size of a spaCy model?

documentation

nlp

vocabulary

spacy