在 NLTK 中使用 Stanford NER Tagger 提取人员和组织列表
Extract list of Persons and Organizations using Stanford NER Tagger in NLTK
我正在尝试在 Python NLTK 中使用斯坦福命名实体识别器 (NER) 提取人员和组织列表。
当我 运行:
from nltk.tag.stanford import NERTagger
st = NERTagger('/usr/share/stanford-ner/classifiers/all.3class.distsim.crf.ser.gz',
'/usr/share/stanford-ner/stanford-ner.jar')
r=st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
print(r)
输出是:
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
我想要的是从此列表中提取以下形式的所有个人和组织:
Rami Eid
Sony Brook University
我试图遍历元组列表:
for x,y in i:
if y == 'ORGANIZATION':
print(x)
但是此代码每行只打印每个实体一个:
Sony
Brook
University
真实的数据可以一句话说出不止一个组织,一个人,不同实体之间的界限如何划分?
感谢 link discovered by @Vaulstein, it is clear that the trained Stanford tagger, as distributed (at least in 2012) does not chunk named entities. From the accepted answer:
Many NER systems use more complex labels such as IOB labels, where codes like B-PERS indicates where a person entity starts. The CRFClassifier class and feature factories support such labels, but they're not used in the models we currently distribute (as of 2012)
您有以下选择:
收集相同标记的单词;例如,所有标记为 PERSON
的相邻词都应作为一个命名实体。这很容易,但当然它有时会组合不同的命名实体。 (例如 New York, Boston [and] Baltimore
大约是三个城市,而不是一个。) 编辑: 这就是 Alvas 的代码在接受的答案中所做的。请参阅下面的更简单的实现。
使用nltk.ne_recognize()
。它不使用 Stanford 识别器,但它使用块实体。 (它是一个 IOB 命名实体标记器的包装器)。
找出一种方法,在斯坦福标注器 returns.
的结果之上进行自己的分块
针对您感兴趣的领域训练您自己的 IOB 命名实体分块器(使用 Stanford 工具或 NLTK 的框架)。如果您有时间和资源正确地执行此操作,它将可能会给你最好的结果。
编辑: 如果您只想拉出连续命名实体的运行(上面的选项 1),您应该使用 itertools.groupby
:
from itertools import groupby
for tag, chunk in groupby(netagged_words, lambda x:x[1]):
if tag != "O":
print("%-12s"%tag, " ".join(w for w, t in chunk))
如果 netagged_words
是您问题中 (word, type)
元组的列表,则生成:
PERSON Rami Eid
ORGANIZATION Stony Brook University
LOCATION NY
再次注意,如果两个相同类型的命名实体紧挨着彼此出现,则此方法会将它们合并。例如。 New York, Boston [and] Baltimore
大约是三个城市,而不是一个。
IOB/BIO表示Inside,Ooutside,Beginning ( IOB), 或有时又名 Beginning, Inside, Ooutside (BIO)
Stanford NE tagger returns IOB/BIO 样式标签,例如
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
('Rami', 'PERSON'), ('Eid', 'PERSON')
被标记为 PERSON,"Rami" 是开头或 NE 块,"Eid" 是内部。然后你会看到任何非 NE 都会被标记为 "O".
提取连续 NE 块的想法与 Named Entity Recognition with Regular Expression: NLTK 非常相似,但是因为 Stanford NE 块 API 没有 return 一个很好的树来解析,你必须做这个:
def get_continuous_chunks(tagged_sent):
continuous_chunk = []
current_chunk = []
for token, tag in tagged_sent:
if tag != "O":
current_chunk.append((token, tag))
else:
if current_chunk: # if the current chunk is not empty
continuous_chunk.append(current_chunk)
current_chunk = []
# Flush the final current_chunk into the continuous_chunk, if any.
if current_chunk:
continuous_chunk.append(current_chunk)
return continuous_chunk
ne_tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
named_entities = get_continuous_chunks(ne_tagged_sent)
named_entities = get_continuous_chunks(ne_tagged_sent)
named_entities_str = [" ".join([token for token, tag in ne]) for ne in named_entities]
named_entities_str_tag = [(" ".join([token for token, tag in ne]), ne[0][1]) for ne in named_entities]
print named_entities
print
print named_entities_str
print
print named_entities_str_tag
print
[出]:
[[('Rami', 'PERSON'), ('Eid', 'PERSON')], [('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION')], [('NY', 'LOCATION')]]
['Rami Eid', 'Stony Brook University', 'NY']
[('Rami Eid', 'PERSON'), ('Stony Brook University', 'ORGANIZATION'), ('NY', 'LOCATION')]
但是请注意限制,如果两个NE是连续的,那么它可能是错误的,但是我仍然想不出任何两个NE是连续的例子,它们之间没有任何"O"。
正如@alexis 所建议的,最好将 stanford NE 输出转换为 NLTK 树:
from nltk import pos_tag
from nltk.chunk import conlltags2tree
from nltk.tree import Tree
def stanfordNE2BIO(tagged_sent):
bio_tagged_sent = []
prev_tag = "O"
for token, tag in tagged_sent:
if tag == "O": #O
bio_tagged_sent.append((token, tag))
prev_tag = tag
continue
if tag != "O" and prev_tag == "O": # Begin NE
bio_tagged_sent.append((token, "B-"+tag))
prev_tag = tag
elif prev_tag != "O" and prev_tag == tag: # Inside NE
bio_tagged_sent.append((token, "I-"+tag))
prev_tag = tag
elif prev_tag != "O" and prev_tag != tag: # Adjacent NE
bio_tagged_sent.append((token, "B-"+tag))
prev_tag = tag
return bio_tagged_sent
def stanfordNE2tree(ne_tagged_sent):
bio_tagged_sent = stanfordNE2BIO(ne_tagged_sent)
sent_tokens, sent_ne_tags = zip(*bio_tagged_sent)
sent_pos_tags = [pos for token, pos in pos_tag(sent_tokens)]
sent_conlltags = [(token, pos, ne) for token, pos, ne in zip(sent_tokens, sent_pos_tags, sent_ne_tags)]
ne_tree = conlltags2tree(sent_conlltags)
return ne_tree
ne_tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'),
('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'),
('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'),
('in', 'O'), ('NY', 'LOCATION')]
ne_tree = stanfordNE2tree(ne_tagged_sent)
print ne_tree
[出]:
(S
(PERSON Rami/NNP Eid/NNP)
is/VBZ
studying/VBG
at/IN
(ORGANIZATION Stony/NNP Brook/NNP University/NNP)
in/IN
(LOCATION NY/NNP))
然后:
ne_in_sent = []
for subtree in ne_tree:
if type(subtree) == Tree: # If subtree is a noun chunk, i.e. NE != "O"
ne_label = subtree.label()
ne_string = " ".join([token for token, pos in subtree.leaves()])
ne_in_sent.append((ne_string, ne_label))
print ne_in_sent
[出]:
[('Rami Eid', 'PERSON'), ('Stony Brook University', 'ORGANIZATION'), ('NY', 'LOCATION')]
不完全按照主题作者的要求打印他想要的东西,也许这可能有任何帮助,
listx = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
def parser(n, string):
for i in listx[n]:
if i == string:
pass
else:
return i
name = parser(0,'PERSON')
lname = parser(1,'PERSON')
org1 = parser(5,'ORGANIZATION')
org2 = parser(6,'ORGANIZATION')
org3 = parser(7,'ORGANIZATION')
print name, lname
print org1, org2, org3
输出会是这样的
Rami Eid
Stony Brook University
使用 python 中的 pycorenlp 包装器,然后使用 'entitymentions' 作为键,在单个字符串中获取连续的个人或组织块。
尝试使用“枚举”方法。
当您将 NER 应用于单词列表时,一旦创建了 (word,type) 的元组,就使用 enumerate(list) 枚举此列表。这将为列表中的每个元组分配一个索引。
所以稍后,当您从列表中提取 PERSON/ORGANISATION/LOCATION 时,它们将附加一个索引。
1 Hussein
2 Obama
3 II
6 James
7 Naismith
21 Naismith
19 Tony
20 Hinkle
0 Frank
1 Mahan
14 Naismith
0 Naismith
0 Mahan
0 Mahan
0 Naismith
现在可以在连续索引的基础上过滤掉单个名字
Hussein Obama II,
James Naismith,
Tony Hank,
Frank Mahan
警告:
即使你得到这个模型 "all.3class.distsim.crf.ser.gz" 请不要使用它因为
第一个原因:
对于这个模型,stanford nlp 的人已经公开为错误的准确性道歉
第二个原因:
由于区分大小写,准确性较差。
解决方案
使用名为 "english.all.3class.caseless.distsim.crf.ser.gz"
的模型
我正在尝试在 Python NLTK 中使用斯坦福命名实体识别器 (NER) 提取人员和组织列表。 当我 运行:
from nltk.tag.stanford import NERTagger
st = NERTagger('/usr/share/stanford-ner/classifiers/all.3class.distsim.crf.ser.gz',
'/usr/share/stanford-ner/stanford-ner.jar')
r=st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
print(r)
输出是:
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
我想要的是从此列表中提取以下形式的所有个人和组织:
Rami Eid
Sony Brook University
我试图遍历元组列表:
for x,y in i:
if y == 'ORGANIZATION':
print(x)
但是此代码每行只打印每个实体一个:
Sony
Brook
University
真实的数据可以一句话说出不止一个组织,一个人,不同实体之间的界限如何划分?
感谢 link discovered by @Vaulstein, it is clear that the trained Stanford tagger, as distributed (at least in 2012) does not chunk named entities. From the accepted answer:
Many NER systems use more complex labels such as IOB labels, where codes like B-PERS indicates where a person entity starts. The CRFClassifier class and feature factories support such labels, but they're not used in the models we currently distribute (as of 2012)
您有以下选择:
收集相同标记的单词;例如,所有标记为
PERSON
的相邻词都应作为一个命名实体。这很容易,但当然它有时会组合不同的命名实体。 (例如New York, Boston [and] Baltimore
大约是三个城市,而不是一个。) 编辑: 这就是 Alvas 的代码在接受的答案中所做的。请参阅下面的更简单的实现。使用
nltk.ne_recognize()
。它不使用 Stanford 识别器,但它使用块实体。 (它是一个 IOB 命名实体标记器的包装器)。找出一种方法,在斯坦福标注器 returns.
的结果之上进行自己的分块
针对您感兴趣的领域训练您自己的 IOB 命名实体分块器(使用 Stanford 工具或 NLTK 的框架)。如果您有时间和资源正确地执行此操作,它将可能会给你最好的结果。
编辑: 如果您只想拉出连续命名实体的运行(上面的选项 1),您应该使用 itertools.groupby
:
from itertools import groupby
for tag, chunk in groupby(netagged_words, lambda x:x[1]):
if tag != "O":
print("%-12s"%tag, " ".join(w for w, t in chunk))
如果 netagged_words
是您问题中 (word, type)
元组的列表,则生成:
PERSON Rami Eid
ORGANIZATION Stony Brook University
LOCATION NY
再次注意,如果两个相同类型的命名实体紧挨着彼此出现,则此方法会将它们合并。例如。 New York, Boston [and] Baltimore
大约是三个城市,而不是一个。
IOB/BIO表示Inside,Ooutside,Beginning ( IOB), 或有时又名 Beginning, Inside, Ooutside (BIO)
Stanford NE tagger returns IOB/BIO 样式标签,例如
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
('Rami', 'PERSON'), ('Eid', 'PERSON')
被标记为 PERSON,"Rami" 是开头或 NE 块,"Eid" 是内部。然后你会看到任何非 NE 都会被标记为 "O".
提取连续 NE 块的想法与 Named Entity Recognition with Regular Expression: NLTK 非常相似,但是因为 Stanford NE 块 API 没有 return 一个很好的树来解析,你必须做这个:
def get_continuous_chunks(tagged_sent):
continuous_chunk = []
current_chunk = []
for token, tag in tagged_sent:
if tag != "O":
current_chunk.append((token, tag))
else:
if current_chunk: # if the current chunk is not empty
continuous_chunk.append(current_chunk)
current_chunk = []
# Flush the final current_chunk into the continuous_chunk, if any.
if current_chunk:
continuous_chunk.append(current_chunk)
return continuous_chunk
ne_tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
named_entities = get_continuous_chunks(ne_tagged_sent)
named_entities = get_continuous_chunks(ne_tagged_sent)
named_entities_str = [" ".join([token for token, tag in ne]) for ne in named_entities]
named_entities_str_tag = [(" ".join([token for token, tag in ne]), ne[0][1]) for ne in named_entities]
print named_entities
print
print named_entities_str
print
print named_entities_str_tag
print
[出]:
[[('Rami', 'PERSON'), ('Eid', 'PERSON')], [('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION')], [('NY', 'LOCATION')]]
['Rami Eid', 'Stony Brook University', 'NY']
[('Rami Eid', 'PERSON'), ('Stony Brook University', 'ORGANIZATION'), ('NY', 'LOCATION')]
但是请注意限制,如果两个NE是连续的,那么它可能是错误的,但是我仍然想不出任何两个NE是连续的例子,它们之间没有任何"O"。
正如@alexis 所建议的,最好将 stanford NE 输出转换为 NLTK 树:
from nltk import pos_tag
from nltk.chunk import conlltags2tree
from nltk.tree import Tree
def stanfordNE2BIO(tagged_sent):
bio_tagged_sent = []
prev_tag = "O"
for token, tag in tagged_sent:
if tag == "O": #O
bio_tagged_sent.append((token, tag))
prev_tag = tag
continue
if tag != "O" and prev_tag == "O": # Begin NE
bio_tagged_sent.append((token, "B-"+tag))
prev_tag = tag
elif prev_tag != "O" and prev_tag == tag: # Inside NE
bio_tagged_sent.append((token, "I-"+tag))
prev_tag = tag
elif prev_tag != "O" and prev_tag != tag: # Adjacent NE
bio_tagged_sent.append((token, "B-"+tag))
prev_tag = tag
return bio_tagged_sent
def stanfordNE2tree(ne_tagged_sent):
bio_tagged_sent = stanfordNE2BIO(ne_tagged_sent)
sent_tokens, sent_ne_tags = zip(*bio_tagged_sent)
sent_pos_tags = [pos for token, pos in pos_tag(sent_tokens)]
sent_conlltags = [(token, pos, ne) for token, pos, ne in zip(sent_tokens, sent_pos_tags, sent_ne_tags)]
ne_tree = conlltags2tree(sent_conlltags)
return ne_tree
ne_tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'),
('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'),
('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'),
('in', 'O'), ('NY', 'LOCATION')]
ne_tree = stanfordNE2tree(ne_tagged_sent)
print ne_tree
[出]:
(S
(PERSON Rami/NNP Eid/NNP)
is/VBZ
studying/VBG
at/IN
(ORGANIZATION Stony/NNP Brook/NNP University/NNP)
in/IN
(LOCATION NY/NNP))
然后:
ne_in_sent = []
for subtree in ne_tree:
if type(subtree) == Tree: # If subtree is a noun chunk, i.e. NE != "O"
ne_label = subtree.label()
ne_string = " ".join([token for token, pos in subtree.leaves()])
ne_in_sent.append((ne_string, ne_label))
print ne_in_sent
[出]:
[('Rami Eid', 'PERSON'), ('Stony Brook University', 'ORGANIZATION'), ('NY', 'LOCATION')]
不完全按照主题作者的要求打印他想要的东西,也许这可能有任何帮助,
listx = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
def parser(n, string):
for i in listx[n]:
if i == string:
pass
else:
return i
name = parser(0,'PERSON')
lname = parser(1,'PERSON')
org1 = parser(5,'ORGANIZATION')
org2 = parser(6,'ORGANIZATION')
org3 = parser(7,'ORGANIZATION')
print name, lname
print org1, org2, org3
输出会是这样的
Rami Eid
Stony Brook University
使用 python 中的 pycorenlp 包装器,然后使用 'entitymentions' 作为键,在单个字符串中获取连续的个人或组织块。
尝试使用“枚举”方法。
当您将 NER 应用于单词列表时,一旦创建了 (word,type) 的元组,就使用 enumerate(list) 枚举此列表。这将为列表中的每个元组分配一个索引。
所以稍后,当您从列表中提取 PERSON/ORGANISATION/LOCATION 时,它们将附加一个索引。
1 Hussein
2 Obama
3 II
6 James
7 Naismith
21 Naismith
19 Tony
20 Hinkle
0 Frank
1 Mahan
14 Naismith
0 Naismith
0 Mahan
0 Mahan
0 Naismith
现在可以在连续索引的基础上过滤掉单个名字
Hussein Obama II, James Naismith, Tony Hank, Frank Mahan
警告: 即使你得到这个模型 "all.3class.distsim.crf.ser.gz" 请不要使用它因为
- 第一个原因:
对于这个模型,stanford nlp 的人已经公开为错误的准确性道歉
- 第二个原因:
由于区分大小写,准确性较差。
- 解决方案
使用名为 "english.all.3class.caseless.distsim.crf.ser.gz"
的模型