分组 NLTK 实体
Grouping NLTK entities
我有以下代码:
import nltk
page = '
EDUCATION
University
Won first prize for the best second year group project, focused on software engineering.
Sixth Form
Mathematics, Economics, French
UK, London
'
for sent in nltk.sent_tokenize(page):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'label'):
print(''.join(c[0] for c in chunk), ' ',chunk.label())
Returns:
EDUCATION ORGANIZATION
UniversityWon ORGANIZATION
Sixth PERSON
FormMathematics ORGANIZATION
Economics PERSON
FrenchUK GPE
London GPE
我想根据实体标签将其分组到一些数据结构中,也许是一个列表:ORGANIZATION=[EDUCATION,UniversityWon,FormMathematics] PERSON=[Sixth,Economics] GPE=[FrenchUK,London ]
或者可能是一个包含以下键的字典:ORGANIZATION、PERSON、GPE,然后关联的值如上所示
字典更有意义,也许是这样的。
from collections import defaultdict
entities = defaultdict(list)
for sent in nltk.sent_tokenize(page):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'label'):
entities[chunk.label()].append(''.join(c[0] for c in chunk))
我有以下代码:
import nltk
page = '
EDUCATION
University
Won first prize for the best second year group project, focused on software engineering.
Sixth Form
Mathematics, Economics, French
UK, London
'
for sent in nltk.sent_tokenize(page):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'label'):
print(''.join(c[0] for c in chunk), ' ',chunk.label())
Returns:
EDUCATION ORGANIZATION
UniversityWon ORGANIZATION
Sixth PERSON
FormMathematics ORGANIZATION
Economics PERSON
FrenchUK GPE
London GPE
我想根据实体标签将其分组到一些数据结构中,也许是一个列表:ORGANIZATION=[EDUCATION,UniversityWon,FormMathematics] PERSON=[Sixth,Economics] GPE=[FrenchUK,London ]
或者可能是一个包含以下键的字典:ORGANIZATION、PERSON、GPE,然后关联的值如上所示
字典更有意义,也许是这样的。
from collections import defaultdict
entities = defaultdict(list)
for sent in nltk.sent_tokenize(page):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'label'):
entities[chunk.label()].append(''.join(c[0] for c in chunk))