spaCy：tokenizer_exceptions 的 NORM 部分是什么？

Question

我正在为我的语言添加 tokenizer_exceptions。我正在查看 'gonna' 英语示例，因此编写了如下规则：

'т.п.': [
    {ORTH: "т.", NORM: "тому", LEMMA: "тот"},
    {ORTH: "п.", NORM: "подобное", LEMMA: "подобный"}
],

然后当我标记化时，我希望规则的 NORM 部分将在 token.norm_ 中（尽管没有关于 Token.norm_ 的任何文档）。但是相反，我在 token.norm_ 中看到 ORTH-part 而在 token-instance 的任何地方我都看不到规则的 NORM-part.

那么什么是 Token.norm_-成员，什么是 NORM-tokenizer_exceptions-规则的一部分？

Answer 1

您可以绑定 nlp.vocab.lex_attr_getters 中的任何函数，它将针对该词条进行计算。每个标记都有一个指向其词汇项的指针，因此它们都将引用这个计算值。

import spacy
from spacy.attrs import NORM

nlp = spacy.blank('ru') # In spacy 1, just spacy.load() here.

doc = nlp(u'a a b c b')

[(w.norm_, w.text) for w in doc]
# (a, a), (a, a), (b, b), (c, c), (b, b)

nlp.vocab.lex_attr_getters[NORM] = lambda string: string.upper()
# This part should be done automatically, but isn't yet.
for lexeme in nlp.vocab:
    lexeme.norm_ = nlp.vocab.lex_attr_getters[NORM](lexeme.orth_)
[(w.norm_, w.text) for w in doc]
# (a, A), (a, A), (b, B), (c, C), (b, B)

您可以将这些词法属性绑定到您想要的任何内容。我不确定俄语的绑定是如何工作的，但您可以在源代码中更改它，或者仅在运行时通过重置词法属性函数进行更改。

Answer 2

更笼统地回答这个问题：在 spaCy v1.x 中，NORM 主要用于提供令牌的 "normalised" 形式，例如，如果标记文本是 "incomplete"（如 gonna 示例）或替代拼写。 v1.x 中规范的主要目的是使其作为 .norm_ 属性可访问以供将来参考。

然而，在v2.x、currently in alpha, the NORM attribute becomes more relevant, as it's also used as a feature in the model. This lets you normalise words with different spellings to one common spelling and ensures that those words receive similar representations – even if one of them is less frequent in your training data. Examples of this are American vs. British spelling in English, or the currency symbols, which are all normalised to $. To make this easier, v2.0 introduces a new language data component, the norm exceptions。

如果您正在开发自己的语言模型，我绝对建议您查看 v2.0 alpha（现在非常接近第一个候选版本）。

spaCy：tokenizer_exceptions 的 NORM 部分是什么？

spaCy: what is NORM-part of tokenizer_exceptions?

python

nlp

tokenize

spacy