如何将列表中的位置大写 Python
How to Capitalize Locations in a List Python
我在 python 中使用 NLTK 库将每个词分解为标记元素(即 ('London', ''NNP))。但是,我无法弄清楚如何获取此列表,如果它们是小写的,则将位置大写。这很重要,因为 london 不再是 'NNP',其他一些地方甚至变成了动词。如果有人知道如何有效地做到这一点,那就太棒了!
这是我的代码:
# returns nature of question with appropriate response text
def chunk_target(self, text, extract_targets):
custom_sent_tokenizer = PunktSentenceTokenizer(text)
tokenized = custom_sent_tokenizer.tokenize(text)
stack = []
for chunk_grammer in extract_targets:
for i in tokenized:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
new = []
# This is where i'm trying to turn valid locations into NNP (capitalise)
for w in tagged:
print(w[0])
for line in self.stations:
if w[0].title() in line.split() and len(w[0]) > 2 and w[0].title() not in new:
new.append(w[0].title())
w = w[0].title()
print(new)
print(tagged)
chunkGram = chunk_grammer
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
stack.append(subtree)
if stack != []:
return stack[0]
return None
您要查找的是 Named Entity Recognition (NER). NLTK does support a named entity function: ne_chunk
,可用于此目的。我来示范一下:
from nltk import word_tokenize, pos_tag, ne_chunk
sentence = "In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
# Tokenize str -> List[str]
tok_sent = word_tokenize(sentence)
# Tag List[str] -> List[Tuple[str, str]]
pos_sent = pos_tag(tok_sent)
print(pos_sent)
# Chunk this tagged data
tree_sent = ne_chunk(pos_sent)
# This returns a Tree, which we pretty-print
tree_sent.pprint()
locations = []
# All subtrees at height 2 will be our named entities
for named_entity in tree_sent.subtrees(lambda t: t.height() == 2):
# Extract named entity type and the chunk
ne_type = named_entity.label()
chunk = " ".join([tagged[0] for tagged in named_entity.leaves()])
print(ne_type, chunk)
if ne_type == "GPE":
locations.append(chunk)
print(locations)
此输出(添加了我的评论):
# pos_tag output:
[('In', 'IN'), ('the', 'DT'), ('wake', 'NN'), ('of', 'IN'), ('a', 'DT'), ('string', 'NN'), ('of', 'IN'), ('abuses', 'NNS'), ('by', 'IN'), ('New', 'NNP'), ('York', 'NNP'), ('police', 'NN'), ('officers', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('1990s', 'CD'), (',', ','), ('Loretta', 'NNP'), ('E.', 'NNP'), ('Lynch', 'NNP'), (',', ','), ('the', 'DT'), ('top', 'JJ'), ('federal', 'JJ'), ('prosecutor', 'NN'), ('in', 'IN'), ('Brooklyn', 'NNP'), (',', ','), ('spoke', 'VBD'), ('forcefully', 'RB'), ('about', 'IN'), ('the', 'DT'), ('pain', 'NN'), ('of', 'IN'), ('a', 'DT'), ('broken', 'JJ'), ('trust', 'NN'), ('that', 'IN'), ('African-Americans', 'NNP'), ('felt', 'VBD'), ('and', 'CC'), ('said', 'VBD'), ('the', 'DT'), ('responsibility', 'NN'), ('for', 'IN'), ('repairing', 'VBG'), ('generations', 'NNS'), ('of', 'IN'), ('miscommunication', 'NN'), ('and', 'CC'), ('mistrust', 'NN'), ('fell', 'VBD'), ('to', 'TO'), ('law', 'NN'), ('enforcement', 'NN'), ('.', '.')]
# ne_chunk output:
(S
In/IN
the/DT
wake/NN
of/IN
a/DT
string/NN
of/IN
abuses/NNS
by/IN
(GPE New/NNP York/NNP)
police/NN
officers/NNS
in/IN
the/DT
1990s/CD
,/,
(PERSON Loretta/NNP E./NNP Lynch/NNP)
,/,
the/DT
top/JJ
federal/JJ
prosecutor/NN
in/IN
(GPE Brooklyn/NNP)
,/,
spoke/VBD
forcefully/RB
about/IN
the/DT
pain/NN
of/IN
a/DT
broken/JJ
trust/NN
that/IN
African-Americans/NNP
felt/VBD
and/CC
said/VBD
the/DT
responsibility/NN
for/IN
repairing/VBG
generations/NNS
of/IN
miscommunication/NN
and/CC
mistrust/NN
fell/VBD
to/TO
law/NN
enforcement/NN
./.)
# All entities found
GPE New York
PERSON Loretta E. Lynch
GPE Brooklyn
# All GPE (Geo-Political Entity)
['New York', 'Brooklyn']
但是,应该注意的是,如果我们从句子中删除所有大写,ne_chunk
的性能似乎会显着下降。
我们可以使用 spaCy 执行类似的操作:
import spacy
import en_core_web_sm
from pprint import pprint
sentence = "In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
nlp = en_core_web_sm.load()
doc = nlp(sentence)
pprint([(X.text, X.label_) for X in doc.ents])
# Then, we can take only `GPE`:
print([X.text for X in doc.ents if X.label_ == "GPE"])
输出:
[('New York', 'GPE'),
('the 1990s', 'DATE'),
('Loretta E. Lynch', 'PERSON'),
('Brooklyn', 'GPE'),
('African-Americans', 'NORP')]
['New York', 'Brooklyn']
此输出(对于 GPE)与 NLTK 的相同,但我提到 spaCy 的原因是因为与 NLTK 不同,它也适用于完整的 lower-case 句子。如果我lower-case我的测试句,那么输出就变成:
[('new york', 'GPE'),
('the 1990s', 'DATE'),
('loretta e. lynch', 'PERSON'),
('brooklyn', 'GPE'),
('african-americans', 'NORP')]
['new york', 'brooklyn']
这允许您 title-case 这些词在另一个 lower-case 句子中。
我在 python 中使用 NLTK 库将每个词分解为标记元素(即 ('London', ''NNP))。但是,我无法弄清楚如何获取此列表,如果它们是小写的,则将位置大写。这很重要,因为 london 不再是 'NNP',其他一些地方甚至变成了动词。如果有人知道如何有效地做到这一点,那就太棒了!
这是我的代码:
# returns nature of question with appropriate response text
def chunk_target(self, text, extract_targets):
custom_sent_tokenizer = PunktSentenceTokenizer(text)
tokenized = custom_sent_tokenizer.tokenize(text)
stack = []
for chunk_grammer in extract_targets:
for i in tokenized:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
new = []
# This is where i'm trying to turn valid locations into NNP (capitalise)
for w in tagged:
print(w[0])
for line in self.stations:
if w[0].title() in line.split() and len(w[0]) > 2 and w[0].title() not in new:
new.append(w[0].title())
w = w[0].title()
print(new)
print(tagged)
chunkGram = chunk_grammer
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
stack.append(subtree)
if stack != []:
return stack[0]
return None
您要查找的是 Named Entity Recognition (NER). NLTK does support a named entity function: ne_chunk
,可用于此目的。我来示范一下:
from nltk import word_tokenize, pos_tag, ne_chunk
sentence = "In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
# Tokenize str -> List[str]
tok_sent = word_tokenize(sentence)
# Tag List[str] -> List[Tuple[str, str]]
pos_sent = pos_tag(tok_sent)
print(pos_sent)
# Chunk this tagged data
tree_sent = ne_chunk(pos_sent)
# This returns a Tree, which we pretty-print
tree_sent.pprint()
locations = []
# All subtrees at height 2 will be our named entities
for named_entity in tree_sent.subtrees(lambda t: t.height() == 2):
# Extract named entity type and the chunk
ne_type = named_entity.label()
chunk = " ".join([tagged[0] for tagged in named_entity.leaves()])
print(ne_type, chunk)
if ne_type == "GPE":
locations.append(chunk)
print(locations)
此输出(添加了我的评论):
# pos_tag output:
[('In', 'IN'), ('the', 'DT'), ('wake', 'NN'), ('of', 'IN'), ('a', 'DT'), ('string', 'NN'), ('of', 'IN'), ('abuses', 'NNS'), ('by', 'IN'), ('New', 'NNP'), ('York', 'NNP'), ('police', 'NN'), ('officers', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('1990s', 'CD'), (',', ','), ('Loretta', 'NNP'), ('E.', 'NNP'), ('Lynch', 'NNP'), (',', ','), ('the', 'DT'), ('top', 'JJ'), ('federal', 'JJ'), ('prosecutor', 'NN'), ('in', 'IN'), ('Brooklyn', 'NNP'), (',', ','), ('spoke', 'VBD'), ('forcefully', 'RB'), ('about', 'IN'), ('the', 'DT'), ('pain', 'NN'), ('of', 'IN'), ('a', 'DT'), ('broken', 'JJ'), ('trust', 'NN'), ('that', 'IN'), ('African-Americans', 'NNP'), ('felt', 'VBD'), ('and', 'CC'), ('said', 'VBD'), ('the', 'DT'), ('responsibility', 'NN'), ('for', 'IN'), ('repairing', 'VBG'), ('generations', 'NNS'), ('of', 'IN'), ('miscommunication', 'NN'), ('and', 'CC'), ('mistrust', 'NN'), ('fell', 'VBD'), ('to', 'TO'), ('law', 'NN'), ('enforcement', 'NN'), ('.', '.')]
# ne_chunk output:
(S
In/IN
the/DT
wake/NN
of/IN
a/DT
string/NN
of/IN
abuses/NNS
by/IN
(GPE New/NNP York/NNP)
police/NN
officers/NNS
in/IN
the/DT
1990s/CD
,/,
(PERSON Loretta/NNP E./NNP Lynch/NNP)
,/,
the/DT
top/JJ
federal/JJ
prosecutor/NN
in/IN
(GPE Brooklyn/NNP)
,/,
spoke/VBD
forcefully/RB
about/IN
the/DT
pain/NN
of/IN
a/DT
broken/JJ
trust/NN
that/IN
African-Americans/NNP
felt/VBD
and/CC
said/VBD
the/DT
responsibility/NN
for/IN
repairing/VBG
generations/NNS
of/IN
miscommunication/NN
and/CC
mistrust/NN
fell/VBD
to/TO
law/NN
enforcement/NN
./.)
# All entities found
GPE New York
PERSON Loretta E. Lynch
GPE Brooklyn
# All GPE (Geo-Political Entity)
['New York', 'Brooklyn']
但是,应该注意的是,如果我们从句子中删除所有大写,ne_chunk
的性能似乎会显着下降。
我们可以使用 spaCy 执行类似的操作:
import spacy
import en_core_web_sm
from pprint import pprint
sentence = "In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
nlp = en_core_web_sm.load()
doc = nlp(sentence)
pprint([(X.text, X.label_) for X in doc.ents])
# Then, we can take only `GPE`:
print([X.text for X in doc.ents if X.label_ == "GPE"])
输出:
[('New York', 'GPE'),
('the 1990s', 'DATE'),
('Loretta E. Lynch', 'PERSON'),
('Brooklyn', 'GPE'),
('African-Americans', 'NORP')]
['New York', 'Brooklyn']
此输出(对于 GPE)与 NLTK 的相同,但我提到 spaCy 的原因是因为与 NLTK 不同,它也适用于完整的 lower-case 句子。如果我lower-case我的测试句,那么输出就变成:
[('new york', 'GPE'),
('the 1990s', 'DATE'),
('loretta e. lynch', 'PERSON'),
('brooklyn', 'GPE'),
('african-americans', 'NORP')]
['new york', 'brooklyn']
这允许您 title-case 这些词在另一个 lower-case 句子中。