在 NLTK 中使用 Stanford NER Tagger 提取人员和组织列表

Question

我正在尝试在 Python NLTK 中使用斯坦福命名实体识别器 (NER) 提取人员和组织列表。当我运行:

from nltk.tag.stanford import NERTagger
st = NERTagger('/usr/share/stanford-ner/classifiers/all.3class.distsim.crf.ser.gz',
               '/usr/share/stanford-ner/stanford-ner.jar') 
r=st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
print(r)

输出是：

[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]

我想要的是从此列表中提取以下形式的所有个人和组织：

Rami Eid
Sony Brook University

我试图遍历元组列表：

for x,y in i:
        if y == 'ORGANIZATION':
            print(x)

但是此代码每行只打印每个实体一个：

Sony 
Brook 
University

真实的数据可以一句话说出不止一个组织，一个人，不同实体之间的界限如何划分？

Answer 1

感谢 link discovered by @Vaulstein, it is clear that the trained Stanford tagger, as distributed (at least in 2012) does not chunk named entities. From the accepted answer:

Many NER systems use more complex labels such as IOB labels, where codes like B-PERS indicates where a person entity starts. The CRFClassifier class and feature factories support such labels, but they're not used in the models we currently distribute (as of 2012)

您有以下选择：

收集相同标记的单词；例如，所有标记为 PERSON 的相邻词都应作为一个命名实体。这很容易，但当然它有时会组合不同的命名实体。（例如 New York, Boston [and] Baltimore 大约是三个城市，而不是一个。） 编辑： 这就是 Alvas 的代码在接受的答案中所做的。请参阅下面的更简单的实现。
使用nltk.ne_recognize()。它不使用 Stanford 识别器，但它使用块实体。（它是一个 IOB 命名实体标记器的包装器）。
找出一种方法，在斯坦福标注器 returns.
针对您感兴趣的领域训练您自己的 IOB 命名实体分块器（使用 Stanford 工具或 NLTK 的框架）。如果您有时间和资源正确地执行此操作，它将可能会给你最好的结果。

编辑： 如果您只想拉出连续命名实体的运行（上面的选项 1），您应该使用 itertools.groupby:

from itertools import groupby
for tag, chunk in groupby(netagged_words, lambda x:x[1]):
    if tag != "O":
        print("%-12s"%tag, " ".join(w for w, t in chunk))

如果 netagged_words 是您问题中 (word, type) 元组的列表，则生成：

PERSON       Rami Eid
ORGANIZATION Stony Brook University
LOCATION     NY

再次注意，如果两个相同类型的命名实体紧挨着彼此出现，则此方法会将它们合并。例如。 New York, Boston [and] Baltimore 大约是三个城市，而不是一个。

Answer 2

IOB/BIO表示Inside,Ooutside,Beginning ( IOB), 或有时又名 Beginning, Inside, Ooutside (BIO)

Stanford NE tagger returns IOB/BIO 样式标签，例如

[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]

('Rami', 'PERSON'), ('Eid', 'PERSON') 被标记为 PERSON，"Rami" 是开头或 NE 块，"Eid" 是内部。然后你会看到任何非 NE 都会被标记为 "O".

提取连续 NE 块的想法与 Named Entity Recognition with Regular Expression: NLTK 非常相似，但是因为 Stanford NE 块 API 没有 return 一个很好的树来解析，你必须做这个：

def get_continuous_chunks(tagged_sent):
    continuous_chunk = []
    current_chunk = []

    for token, tag in tagged_sent:
        if tag != "O":
            current_chunk.append((token, tag))
        else:
            if current_chunk: # if the current chunk is not empty
                continuous_chunk.append(current_chunk)
                current_chunk = []
    # Flush the final current_chunk into the continuous_chunk, if any.
    if current_chunk:
        continuous_chunk.append(current_chunk)
    return continuous_chunk

ne_tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]

named_entities = get_continuous_chunks(ne_tagged_sent)
named_entities = get_continuous_chunks(ne_tagged_sent)
named_entities_str = [" ".join([token for token, tag in ne]) for ne in named_entities]
named_entities_str_tag = [(" ".join([token for token, tag in ne]), ne[0][1]) for ne in named_entities]

print named_entities
print
print named_entities_str
print
print named_entities_str_tag
print

[出]:

[[('Rami', 'PERSON'), ('Eid', 'PERSON')], [('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION')], [('NY', 'LOCATION')]]

['Rami Eid', 'Stony Brook University', 'NY']

[('Rami Eid', 'PERSON'), ('Stony Brook University', 'ORGANIZATION'), ('NY', 'LOCATION')]

但是请注意限制，如果两个NE是连续的，那么它可能是错误的，但是我仍然想不出任何两个NE是连续的例子，它们之间没有任何"O"。

正如@alexis 所建议的，最好将 stanford NE 输出转换为 NLTK 树：

from nltk import pos_tag
from nltk.chunk import conlltags2tree
from nltk.tree import Tree

def stanfordNE2BIO(tagged_sent):
    bio_tagged_sent = []
    prev_tag = "O"
    for token, tag in tagged_sent:
        if tag == "O": #O
            bio_tagged_sent.append((token, tag))
            prev_tag = tag
            continue
        if tag != "O" and prev_tag == "O": # Begin NE
            bio_tagged_sent.append((token, "B-"+tag))
            prev_tag = tag
        elif prev_tag != "O" and prev_tag == tag: # Inside NE
            bio_tagged_sent.append((token, "I-"+tag))
            prev_tag = tag
        elif prev_tag != "O" and prev_tag != tag: # Adjacent NE
            bio_tagged_sent.append((token, "B-"+tag))
            prev_tag = tag

    return bio_tagged_sent


def stanfordNE2tree(ne_tagged_sent):
    bio_tagged_sent = stanfordNE2BIO(ne_tagged_sent)
    sent_tokens, sent_ne_tags = zip(*bio_tagged_sent)
    sent_pos_tags = [pos for token, pos in pos_tag(sent_tokens)]

    sent_conlltags = [(token, pos, ne) for token, pos, ne in zip(sent_tokens, sent_pos_tags, sent_ne_tags)]
    ne_tree = conlltags2tree(sent_conlltags)
    return ne_tree

ne_tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), 
('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), 
('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), 
('in', 'O'), ('NY', 'LOCATION')]

ne_tree = stanfordNE2tree(ne_tagged_sent)

print ne_tree

[出]:

  (S
  (PERSON Rami/NNP Eid/NNP)
  is/VBZ
  studying/VBG
  at/IN
  (ORGANIZATION Stony/NNP Brook/NNP University/NNP)
  in/IN
  (LOCATION NY/NNP))

然后：

ne_in_sent = []
for subtree in ne_tree:
    if type(subtree) == Tree: # If subtree is a noun chunk, i.e. NE != "O"
        ne_label = subtree.label()
        ne_string = " ".join([token for token, pos in subtree.leaves()])
        ne_in_sent.append((ne_string, ne_label))
print ne_in_sent

[出]:

[('Rami Eid', 'PERSON'), ('Stony Brook University', 'ORGANIZATION'), ('NY', 'LOCATION')]

Answer 3

不完全按照主题作者的要求打印他想要的东西，也许这可能有任何帮助，

listx = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]


def parser(n, string):
    for i in listx[n]:
        if i == string:
            pass
        else:
            return i

name = parser(0,'PERSON')
lname = parser(1,'PERSON')
org1 = parser(5,'ORGANIZATION')
org2 = parser(6,'ORGANIZATION')
org3 = parser(7,'ORGANIZATION')


print name, lname
print org1, org2, org3

输出会是这样的

Rami Eid
Stony Brook University

Answer 4

使用 python 中的 pycorenlp 包装器，然后使用 'entitymentions' 作为键，在单个字符串中获取连续的个人或组织块。

Answer 5

尝试使用“枚举”方法。

当您将 NER 应用于单词列表时，一旦创建了 (word,type) 的元组，就使用 enumerate(list) 枚举此列表。这将为列表中的每个元组分配一个索引。

所以稍后，当您从列表中提取 PERSON/ORGANISATION/LOCATION 时，它们将附加一个索引。

1   Hussein
2   Obama
3   II
6   James
7   Naismith
21   Naismith
19   Tony
20   Hinkle
0   Frank
1   Mahan
14   Naismith
0   Naismith
0   Mahan
0   Mahan
0   Naismith

现在可以在连续索引的基础上过滤掉单个名字

Hussein Obama II, James Naismith, Tony Hank, Frank Mahan

Answer 6

警告：即使你得到这个模型 "all.3class.distsim.crf.ser.gz" 请不要使用它因为

第一个原因：

对于这个模型，stanford nlp 的人已经公开为错误的准确性道歉

第二个原因：

由于区分大小写，准确性较差。

解决方案

使用名为 "english.all.3class.caseless.distsim.crf.ser.gz"

的模型

在 NLTK 中使用 Stanford NER Tagger 提取人员和组织列表

Extract list of Persons and Organizations using Stanford NER Tagger in NLTK

python

named-entity-recognition

nltk

stanford-nlp