如何将列表中的位置大写 Python

How to Capitalize Locations in a List Python

我在 python 中使用 NLTK 库将每个词分解为标记元素(即 ('London', ''NNP))。但是,我无法弄清楚如何获取此列表,如果它们是小写的,则将位置大写。这很重要,因为 london 不再是 'NNP',其他一些地方甚至变成了动词。如果有人知道如何有效地做到这一点,那就太棒了!

这是我的代码:

# returns nature of question with appropriate response text
def chunk_target(self, text, extract_targets):
    
    custom_sent_tokenizer = PunktSentenceTokenizer(text)
    tokenized = custom_sent_tokenizer.tokenize(text)
    stack = []
    for chunk_grammer in extract_targets:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            new = []

            # This is where i'm trying to turn valid locations into NNP (capitalise)
            for w in tagged:
                print(w[0])
                for line in self.stations:
                    if w[0].title() in line.split() and len(w[0]) > 2 and w[0].title() not in new:
                        new.append(w[0].title())
                        w = w[0].title()

            print(new)              
            print(tagged)

            chunkGram = chunk_grammer
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
                stack.append(subtree)
        if stack != []:
            return stack[0]
    return None

您要查找的是 Named Entity Recognition (NER). NLTK does support a named entity function: ne_chunk,可用于此目的。我来示范一下:

from nltk import word_tokenize, pos_tag, ne_chunk

sentence = "In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."

# Tokenize str -> List[str]
tok_sent = word_tokenize(sentence)
# Tag List[str] -> List[Tuple[str, str]]
pos_sent = pos_tag(tok_sent)
print(pos_sent)
# Chunk this tagged data
tree_sent = ne_chunk(pos_sent)
# This returns a Tree, which we pretty-print
tree_sent.pprint()

locations = []
# All subtrees at height 2 will be our named entities
for named_entity in tree_sent.subtrees(lambda t: t.height() == 2):
    # Extract named entity type and the chunk
    ne_type = named_entity.label()
    chunk = " ".join([tagged[0] for tagged in named_entity.leaves()])
    print(ne_type, chunk)
    if ne_type == "GPE":
        locations.append(chunk)

print(locations)

此输出(添加了我的评论):

# pos_tag output:
[('In', 'IN'), ('the', 'DT'), ('wake', 'NN'), ('of', 'IN'), ('a', 'DT'), ('string', 'NN'), ('of', 'IN'), ('abuses', 'NNS'), ('by', 'IN'), ('New', 'NNP'), ('York', 'NNP'), ('police', 'NN'), ('officers', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('1990s', 'CD'), (',', ','), ('Loretta', 'NNP'), ('E.', 'NNP'), ('Lynch', 'NNP'), (',', ','), ('the', 'DT'), ('top', 'JJ'), ('federal', 'JJ'), ('prosecutor', 'NN'), ('in', 'IN'), ('Brooklyn', 'NNP'), (',', ','), ('spoke', 'VBD'), ('forcefully', 'RB'), ('about', 'IN'), ('the', 'DT'), ('pain', 'NN'), ('of', 'IN'), ('a', 'DT'), ('broken', 'JJ'), ('trust', 'NN'), ('that', 'IN'), ('African-Americans', 'NNP'), ('felt', 'VBD'), ('and', 'CC'), ('said', 'VBD'), ('the', 'DT'), ('responsibility', 'NN'), ('for', 'IN'), ('repairing', 'VBG'), ('generations', 'NNS'), ('of', 'IN'), ('miscommunication', 'NN'), ('and', 'CC'), ('mistrust', 'NN'), ('fell', 'VBD'), ('to', 'TO'), ('law', 'NN'), ('enforcement', 'NN'), ('.', '.')]
# ne_chunk output:
(S
  In/IN
  the/DT
  wake/NN
  of/IN
  a/DT
  string/NN
  of/IN
  abuses/NNS
  by/IN
  (GPE New/NNP York/NNP)
  police/NN
  officers/NNS
  in/IN
  the/DT
  1990s/CD
  ,/,
  (PERSON Loretta/NNP E./NNP Lynch/NNP)
  ,/,
  the/DT
  top/JJ
  federal/JJ
  prosecutor/NN
  in/IN
  (GPE Brooklyn/NNP)
  ,/,
  spoke/VBD
  forcefully/RB
  about/IN
  the/DT
  pain/NN
  of/IN
  a/DT
  broken/JJ
  trust/NN
  that/IN
  African-Americans/NNP
  felt/VBD
  and/CC
  said/VBD
  the/DT
  responsibility/NN
  for/IN
  repairing/VBG
  generations/NNS
  of/IN
  miscommunication/NN
  and/CC
  mistrust/NN
  fell/VBD
  to/TO
  law/NN
  enforcement/NN
  ./.)
# All entities found
GPE New York
PERSON Loretta E. Lynch
GPE Brooklyn
# All GPE (Geo-Political Entity)
['New York', 'Brooklyn']

但是,应该注意的是,如果我们从句子中删除所有大写,ne_chunk 的性能似乎会显着下降。


我们可以使用 spaCy 执行类似的操作:

import spacy
import en_core_web_sm
from pprint import pprint

sentence = "In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
nlp = en_core_web_sm.load()

doc = nlp(sentence)
pprint([(X.text, X.label_) for X in doc.ents])
# Then, we can take only `GPE`:
print([X.text for X in doc.ents if X.label_ == "GPE"])

输出:

[('New York', 'GPE'),
 ('the 1990s', 'DATE'),
 ('Loretta E. Lynch', 'PERSON'),
 ('Brooklyn', 'GPE'),
 ('African-Americans', 'NORP')]
['New York', 'Brooklyn']

此输出(对于 GPE)与 NLTK 的相同,但我提到 spaCy 的原因是因为与 NLTK 不同,它也适用于完整的 lower-case 句子。如果我lower-case我的测试句,那么输出就变成:

[('new york', 'GPE'),
 ('the 1990s', 'DATE'),
 ('loretta e. lynch', 'PERSON'),
 ('brooklyn', 'GPE'),
 ('african-americans', 'NORP')]
['new york', 'brooklyn']

这允许您 title-case 这些词在另一个 lower-case 句子中。