打印带有删除形容词的 pos 标签 (NLTK)
Print pos tag with removed adjectives (NLTK)
abc = nltk.pos_tag(info)
print(s for s in abc if s[1] != 'ADV')
Returns: 生成器对象位置。当地人>。 genexpr> 在 0x000000000E000D00>
如果使用 [] 圆形打印我得到 "Invalid syntax"
对于形容词,试试这个:
abc = nltk.pos_tag(info)
print [s for s in abc if s[1] != 'JJ']
我猜你只是想得到不是 "adverbs"?
的词性输出
使用括号会导致传递打印函数 generator comprehension。如果您只想一次输出所有内容,请尝试这样的操作(列表理解中的生成器):
print([s for s in abc if s[1] != 'ADV'])
注意:您也可以在不使用 print() 的情况下实现相同的输出。
此外,仅供参考:Last I checked "ADV" 不对应于 pos 标签。如果您想消除副词,那么我认为正确的 pos 标记副词类型是 "RB"、"RBR" 和 "RBS".
根据以下亚历克西斯的回复更新了答案。他是对的,解释不完整。粘贴他的评论反馈:
There's generators, and there's list comprehensions. print(s for s
...) passes print a generator; the version with square brackets uses
the generator in a list comprehension, to make a list.
(也请为 alexis 的评论点赞)
来自https://github.com/nltk/nltk/issues/1783#issuecomment-317174189
The pos_tag()
function is trained on Sections 00-18 of the Wall Street Journal sections of OntoNotes 5.
来自http://www.nltk.org/api/nltk.tag.html#module-nltk.tag
It uses the Penn TreeBank tagset https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
要捕捉所有副词,请检查 RB*
个标签。
使用列表理解,检查标签的前 2 个字符并检查 RB
,例如
>>> from nltk import pos_tag, word_tokenize
>>> sent = "I am running quickly"
>>> [word for word, pos in pos_tag(word_tokenize(sent)) if pos.startswith('RB')]
['quickly']
要捕捉形容词,请检查 JJ*
标签:
>>> sent = "I am running quickly"
>>> sent = "The big red cat is redder than apple"
>>> [word for word, pos in pos_tag(word_tokenize(sent)) if pos.startswith('JJ')]
['big', 'red', 'redder']
如果您只检查 JJ
和 JJ*
(即 .startswith('JJ')
),您将错过比较级和最高级形容词:
>>> sent = "The big red cat is redder than apple, it's the best in the world"
>>> [word for word, pos in pos_tag(word_tokenize(sent)) if pos.startswith('JJ')]
['big', 'red', 'redder', 'best']
>>> [word for word, pos in pos_tag(word_tokenize(sent)) if pos == 'JJ' ]
['big', 'red']
删除只需使用 not
:
>>> [word for word, pos in pos_tag(word_tokenize(sent)) if not pos.startswith('JJ')]
['The', 'cat', 'is', 'than', 'apple', ',', 'it', "'s", 'the', 'in', 'the', 'world']
abc = nltk.pos_tag(info)
print(s for s in abc if s[1] != 'ADV')
Returns: 生成器对象位置。当地人>。 genexpr> 在 0x000000000E000D00>
如果使用 [] 圆形打印我得到 "Invalid syntax"
对于形容词,试试这个:
abc = nltk.pos_tag(info)
print [s for s in abc if s[1] != 'JJ']
我猜你只是想得到不是 "adverbs"?
的词性输出使用括号会导致传递打印函数 generator comprehension。如果您只想一次输出所有内容,请尝试这样的操作(列表理解中的生成器):
print([s for s in abc if s[1] != 'ADV'])
注意:您也可以在不使用 print() 的情况下实现相同的输出。
此外,仅供参考:Last I checked "ADV" 不对应于 pos 标签。如果您想消除副词,那么我认为正确的 pos 标记副词类型是 "RB"、"RBR" 和 "RBS".
根据以下亚历克西斯的回复更新了答案。他是对的,解释不完整。粘贴他的评论反馈:
There's generators, and there's list comprehensions. print(s for s ...) passes print a generator; the version with square brackets uses the generator in a list comprehension, to make a list.
(也请为 alexis 的评论点赞)
来自https://github.com/nltk/nltk/issues/1783#issuecomment-317174189
The
pos_tag()
function is trained on Sections 00-18 of the Wall Street Journal sections of OntoNotes 5.
来自http://www.nltk.org/api/nltk.tag.html#module-nltk.tag
It uses the Penn TreeBank tagset https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
要捕捉所有副词,请检查 RB*
个标签。
使用列表理解,检查标签的前 2 个字符并检查 RB
,例如
>>> from nltk import pos_tag, word_tokenize
>>> sent = "I am running quickly"
>>> [word for word, pos in pos_tag(word_tokenize(sent)) if pos.startswith('RB')]
['quickly']
要捕捉形容词,请检查 JJ*
标签:
>>> sent = "I am running quickly"
>>> sent = "The big red cat is redder than apple"
>>> [word for word, pos in pos_tag(word_tokenize(sent)) if pos.startswith('JJ')]
['big', 'red', 'redder']
如果您只检查 JJ
和 JJ*
(即 .startswith('JJ')
),您将错过比较级和最高级形容词:
>>> sent = "The big red cat is redder than apple, it's the best in the world"
>>> [word for word, pos in pos_tag(word_tokenize(sent)) if pos.startswith('JJ')]
['big', 'red', 'redder', 'best']
>>> [word for word, pos in pos_tag(word_tokenize(sent)) if pos == 'JJ' ]
['big', 'red']
删除只需使用 not
:
>>> [word for word, pos in pos_tag(word_tokenize(sent)) if not pos.startswith('JJ')]
['The', 'cat', 'is', 'than', 'apple', ',', 'it', "'s", 'the', 'in', 'the', 'world']