[ orth , pos , tag , lema 和 text ] 的 spaCy 文档
spaCy Documentation for [ orth , pos , tag, lema and text ]
我是 spaCy 的新手。我添加了此 post 用于文档,并使它对像我这样的新手来说很简单。
import spacy
nlp = spacy.load('en')
doc = nlp(u'KEEP CALM because TOGETHER We Rock !')
for word in doc:
print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_)
print(word.orth_)
我想了解 orth、lemma、tag 和 pos 的含义?此代码还打印出值 print(word)
与 print(word.orth_)
之间的区别
1)当你打印word
时,你基本上打印了Token
class from spacy which is set to print out string from the class. You can see more here。所以它与打印出 word.orth_
或 word.text
不同,后者将直接打印出字符串。
2) 我不确定 word.orth_
,在大多数情况下似乎是 word.text
。对于 word.lemma_
,它是给定单词的词形还原,例如is
、am
、are
将映射到 word.lemma_
中的 be
。
What the meaning of orth, lemma, tag and pos ?
见https://spacy.io/docs/usage/pos-tagging#pos-schemes
What the different between print(word) vs print(word.orth_)
简而言之:
word.orth_
和word.text
是一样的。 cython 属性 以下划线结尾的事实,通常是开发人员不想真正向用户公开的变量。
简而言之:
当您在 https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L537 访问 word.orth_
属性 时,它会尝试访问保存所有词汇表的索引:
property orth_:
def __get__(self):
return self.vocab.strings[self.c.lex.orth]
(详见下文In long
对self.c.lex.orth
的解释)
和 word.text
returns 仅环绕 orth_
属性 的单词的字符串表示,参见 https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L128
property text:
def __get__(self):
return self.orth_
当您打印 print(word)
时,它会调用 __repr__
dunder 函数,returns word.__unicode__
或 word.__byte__
指向word.text
变量,见https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L55
cdef class Token:
"""
An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
"""
def __cinit__(self, Vocab vocab, Doc doc, int offset):
self.vocab = vocab
self.doc = doc
self.c = &self.doc.c[offset]
self.i = offset
def __hash__(self):
return hash((self.doc, self.i))
def __len__(self):
"""
Number of unicode characters in token.text.
"""
return self.c.lex.length
def __unicode__(self):
return self.text
def __bytes__(self):
return self.text.encode('utf8')
def __str__(self):
if is_config(python3=True):
return self.__unicode__()
return self.__bytes__()
def __repr__(self):
return self.__str__()
长:
让我们尝试一步步完成:
>>> import spacy
>>> nlp = spacy.load('en')
>>> doc = nlp(u'This is a foo bar sentence.')
>>> type(doc)
<type 'spacy.tokens.doc.Doc'>
句子传入nlp()
函数后,生成一个spacy.tokens.doc.Doc
对象,来自文档:
cdef class Doc:
"""
A sequence of `Token` objects. Access sentences and named entities,
export annotations to numpy arrays, losslessly serialize to compressed
binary strings.
Aside: Internals
The `Doc` object holds an array of `TokenC` structs.
The Python-level `Token` and `Span` objects are views of this
array, i.e. they don't own the data themselves.
Code: Construction 1
doc = nlp.tokenizer(u'Some text')
Code: Construction 2
doc = Doc(nlp.vocab, orths_and_spaces=[(u'Some', True), (u'text', True)])
"""
所以 spacy.tokens.doc.Doc
对象是 spacy.tokens.token.Token
object. Within the Token
object, we see a wave of cython property
enumerated, e.g. at https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162
的序列
property orth:
def __get__(self):
return self.c.lex.orth
追溯,我们看到 self.c = &self.doc.c[offset]
:
cdef class Token:
"""
An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
"""
def __cinit__(self, Vocab vocab, Doc doc, int offset):
self.vocab = vocab
self.doc = doc
self.c = &self.doc.c[offset]
self.i = offset
没有详尽的文档,我们真的不知道 self.c
是什么意思,但从外观上看,它正在访问 &self.doc
引用中指向 Doc doc
的标记之一传递给 __cinit__
函数。所以很可能,这是访问令牌的捷径
查看 Doc.c
:
cdef class Doc:
def __init__(self, Vocab vocab, words=None, spaces=None, orths_and_spaces=None):
self.vocab = vocab
size = 20
self.mem = Pool()
# Guarantee self.lex[i-x], for any i >= 0 and x < padding is in bounds
# However, we need to remember the true starting places, so that we can
# realloc.
data_start = <TokenC*>self.mem.alloc(size + (PADDING*2), sizeof(TokenC))
cdef int i
for i in range(size + (PADDING*2)):
data_start[i].lex = &EMPTY_LEXEME
data_start[i].l_edge = i
data_start[i].r_edge = i
self.c = data_start + PADDING
现在我们看到 Doc.c
指的是一个 cython 指针数组 data_start
分配内存来存储 spacy.tokens.doc.Doc
对象(如果我得到解释请纠正我<TokenC*>
错了)。
所以回到self.c = &self.doc.c[offset]
,它基本上是试图访问存储数组的内存点,更具体地说是访问数组中的"offset-th"项。
这就是 spacy.tokens.token.Token
。
回到 property
:
property orth:
def __get__(self):
return self.c.lex.orth
我们看到 self.c.lex
正在访问 data_start[i].lex
from spacy.tokens.doc.Doc
而 self.c.lex.orth
只是一个整数,表示 [=33 中保存的单词出现的索引=]内部词汇。
因此,我们看到 property orth_
尝试使用来自 self.c.lex.orth
https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162
的索引访问 self.vocab.strings
property orth_:
def __get__(self):
return self.vocab.strings[self.c.lex.orth]
我是 spaCy 的新手。我添加了此 post 用于文档,并使它对像我这样的新手来说很简单。
import spacy
nlp = spacy.load('en')
doc = nlp(u'KEEP CALM because TOGETHER We Rock !')
for word in doc:
print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_)
print(word.orth_)
我想了解 orth、lemma、tag 和 pos 的含义?此代码还打印出值 print(word)
与 print(word.orth_)
1)当你打印word
时,你基本上打印了Token
class from spacy which is set to print out string from the class. You can see more here。所以它与打印出 word.orth_
或 word.text
不同,后者将直接打印出字符串。
2) 我不确定 word.orth_
,在大多数情况下似乎是 word.text
。对于 word.lemma_
,它是给定单词的词形还原,例如is
、am
、are
将映射到 word.lemma_
中的 be
。
What the meaning of orth, lemma, tag and pos ?
见https://spacy.io/docs/usage/pos-tagging#pos-schemes
What the different between print(word) vs print(word.orth_)
简而言之:
word.orth_
和word.text
是一样的。 cython 属性 以下划线结尾的事实,通常是开发人员不想真正向用户公开的变量。
简而言之:
当您在 https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L537 访问 word.orth_
属性 时,它会尝试访问保存所有词汇表的索引:
property orth_:
def __get__(self):
return self.vocab.strings[self.c.lex.orth]
(详见下文In long
对self.c.lex.orth
的解释)
和 word.text
returns 仅环绕 orth_
属性 的单词的字符串表示,参见 https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L128
property text:
def __get__(self):
return self.orth_
当您打印 print(word)
时,它会调用 __repr__
dunder 函数,returns word.__unicode__
或 word.__byte__
指向word.text
变量,见https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L55
cdef class Token:
"""
An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
"""
def __cinit__(self, Vocab vocab, Doc doc, int offset):
self.vocab = vocab
self.doc = doc
self.c = &self.doc.c[offset]
self.i = offset
def __hash__(self):
return hash((self.doc, self.i))
def __len__(self):
"""
Number of unicode characters in token.text.
"""
return self.c.lex.length
def __unicode__(self):
return self.text
def __bytes__(self):
return self.text.encode('utf8')
def __str__(self):
if is_config(python3=True):
return self.__unicode__()
return self.__bytes__()
def __repr__(self):
return self.__str__()
长:
让我们尝试一步步完成:
>>> import spacy
>>> nlp = spacy.load('en')
>>> doc = nlp(u'This is a foo bar sentence.')
>>> type(doc)
<type 'spacy.tokens.doc.Doc'>
句子传入nlp()
函数后,生成一个spacy.tokens.doc.Doc
对象,来自文档:
cdef class Doc:
"""
A sequence of `Token` objects. Access sentences and named entities,
export annotations to numpy arrays, losslessly serialize to compressed
binary strings.
Aside: Internals
The `Doc` object holds an array of `TokenC` structs.
The Python-level `Token` and `Span` objects are views of this
array, i.e. they don't own the data themselves.
Code: Construction 1
doc = nlp.tokenizer(u'Some text')
Code: Construction 2
doc = Doc(nlp.vocab, orths_and_spaces=[(u'Some', True), (u'text', True)])
"""
所以 spacy.tokens.doc.Doc
对象是 spacy.tokens.token.Token
object. Within the Token
object, we see a wave of cython property
enumerated, e.g. at https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162
property orth:
def __get__(self):
return self.c.lex.orth
追溯,我们看到 self.c = &self.doc.c[offset]
:
cdef class Token:
"""
An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
"""
def __cinit__(self, Vocab vocab, Doc doc, int offset):
self.vocab = vocab
self.doc = doc
self.c = &self.doc.c[offset]
self.i = offset
没有详尽的文档,我们真的不知道 self.c
是什么意思,但从外观上看,它正在访问 &self.doc
引用中指向 Doc doc
的标记之一传递给 __cinit__
函数。所以很可能,这是访问令牌的捷径
查看 Doc.c
:
cdef class Doc:
def __init__(self, Vocab vocab, words=None, spaces=None, orths_and_spaces=None):
self.vocab = vocab
size = 20
self.mem = Pool()
# Guarantee self.lex[i-x], for any i >= 0 and x < padding is in bounds
# However, we need to remember the true starting places, so that we can
# realloc.
data_start = <TokenC*>self.mem.alloc(size + (PADDING*2), sizeof(TokenC))
cdef int i
for i in range(size + (PADDING*2)):
data_start[i].lex = &EMPTY_LEXEME
data_start[i].l_edge = i
data_start[i].r_edge = i
self.c = data_start + PADDING
现在我们看到 Doc.c
指的是一个 cython 指针数组 data_start
分配内存来存储 spacy.tokens.doc.Doc
对象(如果我得到解释请纠正我<TokenC*>
错了)。
所以回到self.c = &self.doc.c[offset]
,它基本上是试图访问存储数组的内存点,更具体地说是访问数组中的"offset-th"项。
这就是 spacy.tokens.token.Token
。
回到 property
:
property orth:
def __get__(self):
return self.c.lex.orth
我们看到 self.c.lex
正在访问 data_start[i].lex
from spacy.tokens.doc.Doc
而 self.c.lex.orth
只是一个整数,表示 [=33 中保存的单词出现的索引=]内部词汇。
因此,我们看到 property orth_
尝试使用来自 self.c.lex.orth
https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162
self.vocab.strings
property orth_:
def __get__(self):
return self.vocab.strings[self.c.lex.orth]