激发 NLTK 词性标注器报告复数专有名词
Provoke the NLTK part-of-speech tagger to report a plural proper noun
让我们试试 Python 在 nltk
包中著名的词性标注器。
import nltk
# You might also need to run nltk.download('maxent_treebank_pos_tagger')
# even after installing nltk
string = 'Buddy Billy went to the moon and came Back with several Vikings.'
nltk.pos_tag(nltk.word_tokenize(string))
这给了我
[('Buddy', 'NNP'), ('Billy', 'NNP'), ('went', 'VBD'), ('to', 'TO'),
('the', 'DT'), ('moon', 'NN'), ('and', 'CC'), ('came', 'VBD'),
('Back', 'NNP'), ('with', 'IN'), ('several', 'JJ'), ('Vikings',
'NNS'), ('.', '.')]
您可以解读代码 here。我对 'Back' 被归类为专有名词 (NNP) 感到有些失望,尽管这种混淆是可以理解的。我更生气的是 'Vikings' 被称为简单复数名词 (NNS) 而不是复数专有名词 (NNPS)。任何人都可以想出一个简短输入的示例,该示例至少会导致一个 NNPS 标签吗?
NLTK 棕色语料库中的标签似乎存在一些问题,将 NNPS
标记为 NPS
(可能 NLTK 标签集是一个 updated/outdated 标签,不同于 https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)
这是 plural proper nouns
的示例:
>>> from nltk.corpus import brown
>>> for sent in brown.tagged_sents():
... if any(pos for word, pos in sent if pos == 'NPS'):
... print sent
... break
...
[(u'Georgia', u'NP'), (u'Republicans', u'NPS'), (u'are', u'BER'), (u'getting', u'VBG'), (u'strong', u'JJ'), (u'encouragement', u'NN'), (u'to', u'TO'), (u'enter', u'VB'), (u'a', u'AT'), (u'candidate', u'NN'), (u'in', u'IN'), (u'the', u'AT'), (u'1962', u'CD'), (u"governor's", u'NN$'), (u'race', u'NN'), (u',', u','), (u'a', u'AT'), (u'top', u'JJS'), (u'official', u'NN'), (u'said', u'VBD'), (u'Wednesday', u'NR'), (u'.', u'.')]
但是如果你用 nltk.pos_tag
标记,你会得到 NNPS
:
>>> for sent in brown.tagged_sents():
... if any(pos for word, pos in sent if pos == 'NPS'):
... print " ".join([word for word, pos in sent])
... break
...
Georgia Republicans are getting strong encouragement to enter a candidate in the 1962 governor's race , a top official said Wednesday .
>>> from nltk import pos_tag
>>> pos_tag("Georgia Republicans are getting strong encouragement to enter a candidate in the 1962 governor's race , a top official said Wednesday .".split())
[('Georgia', 'NNP'), ('Republicans', 'NNPS'), ('are', 'VBP'), ('getting', 'VBG'), ('strong', 'JJ'), ('encouragement', 'NN'), ('to', 'TO'), ('enter', 'VB'), ('a', 'DT'), ('candidate', 'NN'), ('in', 'IN'), ('the', 'DT'), ('1962', 'CD'), ("governor's", 'NNS'), ('race', 'NN'), (',', ','), ('a', 'DT'), ('top', 'JJ'), ('official', 'NN'), ('said', 'VBD'), ('Wednesday', 'NNP'), ('.', '.')]
让我们试试 Python 在 nltk
包中著名的词性标注器。
import nltk
# You might also need to run nltk.download('maxent_treebank_pos_tagger')
# even after installing nltk
string = 'Buddy Billy went to the moon and came Back with several Vikings.'
nltk.pos_tag(nltk.word_tokenize(string))
这给了我
[('Buddy', 'NNP'), ('Billy', 'NNP'), ('went', 'VBD'), ('to', 'TO'), ('the', 'DT'), ('moon', 'NN'), ('and', 'CC'), ('came', 'VBD'), ('Back', 'NNP'), ('with', 'IN'), ('several', 'JJ'), ('Vikings', 'NNS'), ('.', '.')]
您可以解读代码 here。我对 'Back' 被归类为专有名词 (NNP) 感到有些失望,尽管这种混淆是可以理解的。我更生气的是 'Vikings' 被称为简单复数名词 (NNS) 而不是复数专有名词 (NNPS)。任何人都可以想出一个简短输入的示例,该示例至少会导致一个 NNPS 标签吗?
NLTK 棕色语料库中的标签似乎存在一些问题,将 NNPS
标记为 NPS
(可能 NLTK 标签集是一个 updated/outdated 标签,不同于 https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)
这是 plural proper nouns
的示例:
>>> from nltk.corpus import brown
>>> for sent in brown.tagged_sents():
... if any(pos for word, pos in sent if pos == 'NPS'):
... print sent
... break
...
[(u'Georgia', u'NP'), (u'Republicans', u'NPS'), (u'are', u'BER'), (u'getting', u'VBG'), (u'strong', u'JJ'), (u'encouragement', u'NN'), (u'to', u'TO'), (u'enter', u'VB'), (u'a', u'AT'), (u'candidate', u'NN'), (u'in', u'IN'), (u'the', u'AT'), (u'1962', u'CD'), (u"governor's", u'NN$'), (u'race', u'NN'), (u',', u','), (u'a', u'AT'), (u'top', u'JJS'), (u'official', u'NN'), (u'said', u'VBD'), (u'Wednesday', u'NR'), (u'.', u'.')]
但是如果你用 nltk.pos_tag
标记,你会得到 NNPS
:
>>> for sent in brown.tagged_sents():
... if any(pos for word, pos in sent if pos == 'NPS'):
... print " ".join([word for word, pos in sent])
... break
...
Georgia Republicans are getting strong encouragement to enter a candidate in the 1962 governor's race , a top official said Wednesday .
>>> from nltk import pos_tag
>>> pos_tag("Georgia Republicans are getting strong encouragement to enter a candidate in the 1962 governor's race , a top official said Wednesday .".split())
[('Georgia', 'NNP'), ('Republicans', 'NNPS'), ('are', 'VBP'), ('getting', 'VBG'), ('strong', 'JJ'), ('encouragement', 'NN'), ('to', 'TO'), ('enter', 'VB'), ('a', 'DT'), ('candidate', 'NN'), ('in', 'IN'), ('the', 'DT'), ('1962', 'CD'), ("governor's", 'NNS'), ('race', 'NN'), (',', ','), ('a', 'DT'), ('top', 'JJ'), ('official', 'NN'), ('said', 'VBD'), ('Wednesday', 'NNP'), ('.', '.')]