词性标注和实体识别 - python

Question

我想在 python 中执行词性标注和实体识别，类似于 R 中 openNLP 的 Maxent_POS_Tag_Annotator 和 Maxent_Entity_Annotator 功能。我更喜欢 [=21 中的代码=] 将输入作为文本句子并以不同的特征给出输出 - 例如 "CC" 的数量、"CD" 的数量、"DT" 的数量等。CC、CD、DT 是 POS 标签如在 Penn Treebank 中使用的那样。所以应该有 36 个 columns/features 用于 POS 标记对应于 Penn Treebank POS 中的 36 个词性标记。我想在 Azure ML "Execute Python Script" 模块上实现它，Azure ML 支持 python 2.7.7。我听说 python 中的 nltk 可以完成这项工作，但我是 python 中的初学者。任何帮助，将不胜感激。

Answer 1

查看 NTLK book，分类和标记词部分。

简单示例，它使用 Penn Treebank 标记集：

from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
pos_tag(word_tokenize("John's big idea isn't all that bad.")) 

[('John', 'NNP'),
("'s", 'POS'),
 ('big', 'JJ'),
 ('idea', 'NN'),
 ('is', 'VBZ'),
 ("n't", 'RB'),
 ('all', 'DT'),
 ('that', 'DT'),
 ('bad', 'JJ'),
 ('.', '.')]

那你就可以使用

from collections import defaultdict
counts = defaultdict(int)
for (word, tag) in pos_tag(word_tokenize("John's big idea isn't all that bad.")):
    counts[tag] += 1

获取频率：

defaultdict(<type 'int'>, {'JJ': 2, 'NN': 1, 'POS': 1, '.': 1, 'RB': 1, 'VBZ': 1, 'DT': 2, 'NNP': 1})

词性标注和实体识别 - python

Part of speech tagging and entity recognition - python

python

named-entity-recognition

azure

part-of-speech

azure-machine-learning-studio