[ orth , pos , tag , lema 和 text ] 的 spaCy 文档

Question

我是 spaCy 的新手。我添加了此 post 用于文档，并使它对像我这样的新手来说很简单。

import spacy
nlp = spacy.load('en')
doc = nlp(u'KEEP CALM because TOGETHER We Rock !')
for word in doc:
    print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_)
    print(word.orth_)

我想了解 orth、lemma、tag 和 pos 的含义？此代码还打印出值 print(word) 与 print(word.orth_)

之间的区别

Answer 1

1)当你打印word时，你基本上打印了Token class from spacy which is set to print out string from the class. You can see more here。所以它与打印出 word.orth_ 或 word.text 不同，后者将直接打印出字符串。

2) 我不确定 word.orth_，在大多数情况下似乎是 word.text。对于 word.lemma_，它是给定单词的词形还原，例如is、am、are 将映射到 word.lemma_ 中的 be。

Answer 2

What the meaning of orth, lemma, tag and pos ?

见https://spacy.io/docs/usage/pos-tagging#pos-schemes

What the different between print(word) vs print(word.orth_)

简而言之：

word.orth_和word.text是一样的。 cython 属性以下划线结尾的事实，通常是开发人员不想真正向用户公开的变量。

简而言之：

当您在 https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L537 访问 word.orth_ 属性时，它会尝试访问保存所有词汇表的索引：

property orth_:
        def __get__(self):
            return self.vocab.strings[self.c.lex.orth]

（详见下文In long对self.c.lex.orth的解释）

和 word.text returns 仅环绕 orth_ 属性的单词的字符串表示，参见 https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L128

property text:
    def __get__(self):
        return self.orth_

当您打印 print(word) 时，它会调用 __repr__ dunder 函数，returns word.__unicode__ 或 word.__byte__ 指向word.text变量，见https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L55

cdef class Token:
    """
    An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
    """
    def __cinit__(self, Vocab vocab, Doc doc, int offset):
        self.vocab = vocab
        self.doc = doc
        self.c = &self.doc.c[offset]
        self.i = offset

    def __hash__(self):
        return hash((self.doc, self.i))

    def __len__(self):
        """
        Number of unicode characters in token.text.
        """
        return self.c.lex.length

    def __unicode__(self):
        return self.text

    def __bytes__(self):
        return self.text.encode('utf8')

    def __str__(self):
        if is_config(python3=True):
            return self.__unicode__()
        return self.__bytes__()

    def __repr__(self):
        return self.__str__()

长：

让我们尝试一步步完成：

>>> import spacy
>>> nlp = spacy.load('en')
>>> doc = nlp(u'This is a foo bar sentence.')
>>> type(doc)
<type 'spacy.tokens.doc.Doc'>

句子传入nlp()函数后，生成一个spacy.tokens.doc.Doc对象，来自文档：

cdef class Doc:
    """
    A sequence of `Token` objects. Access sentences and named entities,
    export annotations to numpy arrays, losslessly serialize to compressed
    binary strings.
    Aside: Internals
        The `Doc` object holds an array of `TokenC` structs.
        The Python-level `Token` and `Span` objects are views of this
        array, i.e. they don't own the data themselves.
    Code: Construction 1
        doc = nlp.tokenizer(u'Some text')
    Code: Construction 2
        doc = Doc(nlp.vocab, orths_and_spaces=[(u'Some', True), (u'text', True)])
    """

所以 spacy.tokens.doc.Doc 对象是 spacy.tokens.token.Token object. Within the Token object, we see a wave of cython property enumerated, e.g. at https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162

的序列

property orth:
    def __get__(self):
        return self.c.lex.orth

追溯，我们看到 self.c = &self.doc.c[offset]:

cdef class Token:
    """
    An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
    """
    def __cinit__(self, Vocab vocab, Doc doc, int offset):
        self.vocab = vocab
        self.doc = doc
        self.c = &self.doc.c[offset]
        self.i = offset

没有详尽的文档，我们真的不知道 self.c 是什么意思，但从外观上看，它正在访问 &self.doc 引用中指向 Doc doc 的标记之一传递给 __cinit__ 函数。所以很可能，这是访问令牌的捷径

查看 Doc.c:

cdef class Doc:
    def __init__(self, Vocab vocab, words=None, spaces=None, orths_and_spaces=None):
        self.vocab = vocab
        size = 20
        self.mem = Pool()
        # Guarantee self.lex[i-x], for any i >= 0 and x < padding is in bounds
        # However, we need to remember the true starting places, so that we can
        # realloc.
        data_start = <TokenC*>self.mem.alloc(size + (PADDING*2), sizeof(TokenC))
        cdef int i
        for i in range(size + (PADDING*2)):
            data_start[i].lex = &EMPTY_LEXEME
            data_start[i].l_edge = i
            data_start[i].r_edge = i
        self.c = data_start + PADDING

现在我们看到 Doc.c 指的是一个 cython 指针数组 data_start 分配内存来存储 spacy.tokens.doc.Doc 对象（如果我得到解释请纠正我<TokenC*>错了）。

所以回到self.c = &self.doc.c[offset]，它基本上是试图访问存储数组的内存点，更具体地说是访问数组中的"offset-th"项。

这就是 spacy.tokens.token.Token。

回到 property:

property orth:
    def __get__(self):
        return self.c.lex.orth

我们看到 self.c.lex 正在访问 data_start[i].lex from spacy.tokens.doc.Doc 而 self.c.lex.orth 只是一个整数，表示 [=33 中保存的单词出现的索引=]内部词汇。

因此，我们看到 property orth_ 尝试使用来自 self.c.lex.orth https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162

的索引访问 self.vocab.strings

property orth_:
        def __get__(self):
            return self.vocab.strings[self.c.lex.orth]

[ orth , pos , tag , lema 和 text ] 的 spaCy 文档

spaCy Documentation for [ orth , pos , tag, lema and text ]

python

nlp

cython

spacy

What the meaning of orth, lemma, tag and pos ?

What the different between print(word) vs print(word.orth_)