分组 NLTK 实体

Grouping NLTK entities

我有以下代码:

import nltk
 
page = '
EDUCATION   
University
Won first prize for the best second year group project, focused on software engineering.
Sixth Form
Mathematics, Economics, French
UK, London
'


for sent in nltk.sent_tokenize(page):
  for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
     if hasattr(chunk, 'label'):
        print(''.join(c[0] for c in chunk), ' ',chunk.label())

Returns:

EDUCATION   ORGANIZATION
UniversityWon   ORGANIZATION
Sixth   PERSON
FormMathematics   ORGANIZATION
Economics   PERSON
FrenchUK   GPE
London   GPE

我想根据实体标签将其分组到一些数据结构中,也许是一个列表:ORGANIZATION=[EDUCATION,UniversityWon,FormMathematics] PERSON=[Sixth,Economics] GPE=[FrenchUK,London ]

或者可能是一个包含以下键的字典:ORGANIZATION、PERSON、GPE,然后关联的值如上所示

字典更有意义,也许是这样的。

from collections import defaultdict

entities = defaultdict(list)

for sent in nltk.sent_tokenize(page):
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
        if hasattr(chunk, 'label'):
            entities[chunk.label()].append(''.join(c[0] for c in chunk))