在 NLTK 中使用 Stanford NER Tagger 提取人员和组织列表

Extract list of Persons and Organizations using Stanford NER Tagger in NLTK

我正在尝试在 Python NLTK 中使用斯坦福命名实体识别器 (NER) 提取人员和组织列表。 当我 运行:

from nltk.tag.stanford import NERTagger
st = NERTagger('/usr/share/stanford-ner/classifiers/all.3class.distsim.crf.ser.gz',
               '/usr/share/stanford-ner/stanford-ner.jar') 
r=st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
print(r) 

输出是:

[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]

我想要的是从此列表中提取以下形式的所有个人和组织:

Rami Eid
Sony Brook University

我试图遍历元组列表:

for x,y in i:
        if y == 'ORGANIZATION':
            print(x)

但是此代码每行只打印每个实体一个:

Sony 
Brook 
University

真实的数据可以一句话说出不止一个组织,一个人,不同实体之间的界限如何划分?

感谢 link discovered by @Vaulstein, it is clear that the trained Stanford tagger, as distributed (at least in 2012) does not chunk named entities. From the accepted answer:

Many NER systems use more complex labels such as IOB labels, where codes like B-PERS indicates where a person entity starts. The CRFClassifier class and feature factories support such labels, but they're not used in the models we currently distribute (as of 2012)

您有以下选择:

  1. 收集相同标记的单词;例如,所有标记为 PERSON 的相邻词都应作为一个命名实体。这很容易,但当然它有时会组合不同的命名实体。 (例如 New York, Boston [and] Baltimore 大约是三个城市,而不是一个。) 编辑: 这就是 Alvas 的代码在接受的答案中所做的。请参阅下面的更简单的实现。

  2. 使用nltk.ne_recognize()。它不使用 Stanford 识别器,但它使用块实体。 (它是一个 IOB 命名实体标记器的包装器)。

  3. 找出一种方法,在斯坦福标注器 returns.

  4. 的结果之上进行自己的分块
  5. 针对您感兴趣的领域训练您自己的 IOB 命名实体分块器(使用 Stanford 工具或 NLTK 的框架)。如果您有时间和资源正确地执行此操作,它将可能会给你最好的结果。

编辑: 如果您只想拉出连续命名实体的运行(上面的选项 1),您应该使用 itertools.groupby:

from itertools import groupby
for tag, chunk in groupby(netagged_words, lambda x:x[1]):
    if tag != "O":
        print("%-12s"%tag, " ".join(w for w, t in chunk))

如果 netagged_words 是您问题中 (word, type) 元组的列表,则生成:

PERSON       Rami Eid
ORGANIZATION Stony Brook University
LOCATION     NY

再次注意,如果两个相同类型的命名实体紧挨着彼此出现,则此方法会将它们合并。例如。 New York, Boston [and] Baltimore 大约是三个城市,而不是一个。

IOB/BIO表示Inside,Ooutside,Beginning ( IOB), 或有时又名 Beginning, Inside, Ooutside (BIO)

Stanford NE tagger returns IOB/BIO 样式标签,例如

[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]

('Rami', 'PERSON'), ('Eid', 'PERSON') 被标记为 PERSON,"Rami" 是开头或 NE 块,"Eid" 是内部。然后你会看到任何非 NE 都会被标记为 "O".

提取连续 NE 块的想法与 Named Entity Recognition with Regular Expression: NLTK 非常相似,但是因为 Stanford NE 块 API 没有 return 一个很好的树来解析,你必须做这个:

def get_continuous_chunks(tagged_sent):
    continuous_chunk = []
    current_chunk = []

    for token, tag in tagged_sent:
        if tag != "O":
            current_chunk.append((token, tag))
        else:
            if current_chunk: # if the current chunk is not empty
                continuous_chunk.append(current_chunk)
                current_chunk = []
    # Flush the final current_chunk into the continuous_chunk, if any.
    if current_chunk:
        continuous_chunk.append(current_chunk)
    return continuous_chunk

ne_tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]

named_entities = get_continuous_chunks(ne_tagged_sent)
named_entities = get_continuous_chunks(ne_tagged_sent)
named_entities_str = [" ".join([token for token, tag in ne]) for ne in named_entities]
named_entities_str_tag = [(" ".join([token for token, tag in ne]), ne[0][1]) for ne in named_entities]

print named_entities
print
print named_entities_str
print
print named_entities_str_tag
print

[出]:

[[('Rami', 'PERSON'), ('Eid', 'PERSON')], [('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION')], [('NY', 'LOCATION')]]

['Rami Eid', 'Stony Brook University', 'NY']

[('Rami Eid', 'PERSON'), ('Stony Brook University', 'ORGANIZATION'), ('NY', 'LOCATION')]

但是请注意限制,如果两个NE是连续的,那么它可能是错误的,但是我仍然想不出任何两个NE是连续的例子,它们之间没有任何"O"。


正如@alexis 所建议的,最好将 stanford NE 输出转换为 NLTK 树:

from nltk import pos_tag
from nltk.chunk import conlltags2tree
from nltk.tree import Tree

def stanfordNE2BIO(tagged_sent):
    bio_tagged_sent = []
    prev_tag = "O"
    for token, tag in tagged_sent:
        if tag == "O": #O
            bio_tagged_sent.append((token, tag))
            prev_tag = tag
            continue
        if tag != "O" and prev_tag == "O": # Begin NE
            bio_tagged_sent.append((token, "B-"+tag))
            prev_tag = tag
        elif prev_tag != "O" and prev_tag == tag: # Inside NE
            bio_tagged_sent.append((token, "I-"+tag))
            prev_tag = tag
        elif prev_tag != "O" and prev_tag != tag: # Adjacent NE
            bio_tagged_sent.append((token, "B-"+tag))
            prev_tag = tag

    return bio_tagged_sent


def stanfordNE2tree(ne_tagged_sent):
    bio_tagged_sent = stanfordNE2BIO(ne_tagged_sent)
    sent_tokens, sent_ne_tags = zip(*bio_tagged_sent)
    sent_pos_tags = [pos for token, pos in pos_tag(sent_tokens)]

    sent_conlltags = [(token, pos, ne) for token, pos, ne in zip(sent_tokens, sent_pos_tags, sent_ne_tags)]
    ne_tree = conlltags2tree(sent_conlltags)
    return ne_tree

ne_tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), 
('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), 
('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), 
('in', 'O'), ('NY', 'LOCATION')]

ne_tree = stanfordNE2tree(ne_tagged_sent)

print ne_tree

[出]:

  (S
  (PERSON Rami/NNP Eid/NNP)
  is/VBZ
  studying/VBG
  at/IN
  (ORGANIZATION Stony/NNP Brook/NNP University/NNP)
  in/IN
  (LOCATION NY/NNP))

然后:

ne_in_sent = []
for subtree in ne_tree:
    if type(subtree) == Tree: # If subtree is a noun chunk, i.e. NE != "O"
        ne_label = subtree.label()
        ne_string = " ".join([token for token, pos in subtree.leaves()])
        ne_in_sent.append((ne_string, ne_label))
print ne_in_sent

[出]:

[('Rami Eid', 'PERSON'), ('Stony Brook University', 'ORGANIZATION'), ('NY', 'LOCATION')]

不完全按照主题作者的要求打印他想要的东西,也许这可能有任何帮助,

listx = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]


def parser(n, string):
    for i in listx[n]:
        if i == string:
            pass
        else:
            return i

name = parser(0,'PERSON')
lname = parser(1,'PERSON')
org1 = parser(5,'ORGANIZATION')
org2 = parser(6,'ORGANIZATION')
org3 = parser(7,'ORGANIZATION')


print name, lname
print org1, org2, org3

输出会是这样的

Rami Eid
Stony Brook University

使用 python 中的 pycorenlp 包装器,然后使用 'entitymentions' 作为键,在单个字符串中获取连续的个人或组织块。

尝试使用“枚举”方法。

当您将 NER 应用于单词列表时,一旦创建了 (word,type) 的元组,就使用 enumerate(list) 枚举此列表。这将为列表中的每个元组分配一个索引。

所以稍后,当您从列表中提取 PERSON/ORGANISATION/LOCATION 时,它们将附加一个索引。

1   Hussein
2   Obama
3   II
6   James
7   Naismith
21   Naismith
19   Tony
20   Hinkle
0   Frank
1   Mahan
14   Naismith
0   Naismith
0   Mahan
0   Mahan
0   Naismith

现在可以在连续索引的基础上过滤掉单个名字

Hussein Obama II, James Naismith, Tony Hank, Frank Mahan

警告: 即使你得到这个模型 "all.3class.distsim.crf.ser.gz" 请不要使用它因为

    第一个原因:

对于这个模型,stanford nlp 的人已经公开为错误的准确性道歉

    第二个原因:

由于区分大小写,准确性较差。

    解决方案

使用名为 "english.all.3class.caseless.distsim.crf.ser.gz"

的模型