在字符串中查找连续连接的名词或代词

Question

我想在文本中查找独立或连续连接的名词。我把下面的代码放在一起，但它既不高效也不符合 pythonic。有人有更 pythonic 的方法来使用 spaCy 查找这些名词吗？

下面的代码构建了一个包含所有标记的字典，然后遍历它们以查找独立的或连接的 PROPN 或 NOUN，直到 for 循环超出范围。它 returns 收集的项目列表。

def extract_unnamed_ents(doc):
  """Takes a string and returns a list of all succesively connected nouns or pronouns""" 
  nlp_doc = nlp(doc)
  token_list = []
  for token in nlp_doc:
    token_dict = {}
    token_dict['lemma'] = token.lemma_
    token_dict['pos'] = token.pos_
    token_dict['tag'] = token.tag_
    token_list.append(token_dict)
  ents = []
  k = 0
  for i in range(len(token_list)):
    try:
      if token_list[k]['pos'] == 'PROPN' or token_list[k]['pos'] == 'NOUN':
        ent = token_list[k]['lemma']

        if token_list[k+1]['pos'] == 'PROPN' or token_list[k+1]['pos'] == 'NOUN':
          ent = ent + ' ' + token_list[k+1]['lemma']
          k += 1
          if token_list[k+1]['pos'] == 'PROPN' or token_list[k+1]['pos'] == 'NOUN':
            ent = ent + ' ' + token_list[k+1]['lemma']
            k += 1
            if token_list[k+1]['pos'] == 'PROPN' or token_list[k+1]['pos'] == 'NOUN':
              ent = ent + ' ' + token_list[k+1]['lemma']
              k += 1
              if token_list[k+1]['pos'] == 'PROPN' or token_list[k+1]['pos'] == 'NOUN':
                ent = ent + ' ' + token_list[k+1]['lemma']
                k += 1
        if ent not in ents:
          ents.append(ent)
    except:
      pass
    k += 1
  return ents

测试：

extract_unnamed_ents('Chancellor Angela Merkel and some of her ministers will discuss at a cabinet '
                     "retreat next week ways to avert driving bans in major cities after Germany's "
                     'top administrative court in February allowed local authorities to bar '
                     'heavily polluting diesel cars.')

输出：

['Chancellor Angela Merkel',
 'minister',
 'cabinet retreat',
 'week way',
 'ban',
 'city',
 'Germany',
 'court',
 'February',
 'authority',
 'diesel car']

Answer 1

spacy 有办法做到这一点，但我不确定它是否能准确地满足您的需求

import spacy

text = """Chancellor Angela Merkel and some of her ministers will discuss
at a cabinet retreat next week ways to avert driving bans in
major cities after Germany's top administrative court
in February allowed local authorities to bar heavily
polluting diesel cars.
""".replace('\n', ' ')

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
print([i.text for i in doc.noun_chunks])

给予

['Chancellor Angela Merkel', 'her ministers', 'a cabinet retreat', 'ways', 'driving bans', 'major cities', "Germany's top administrative court", 'February', 'local authorities', 'heavily polluting diesel cars']

这里，但是 i.lemma_ 行并没有真正给你你想要的东西（我认为这可能会被 this recent PR 修复）。

因为它不完全是你可以像这样使用 itertools.groupby 之后的样子

import itertools

out = []
for i, j in itertools.groupby(doc, key=lambda i: i.pos_):
    if i not in ("PROPN", "NOUN"):
        continue
    out.append(' '.join(k.lemma_ for k in j))
print(out)

给予

['Chancellor Angela Merkel', 'minister', 'cabinet retreat', 'week way', 'ban', 'city', 'Germany', 'court', 'February', 'authority', 'diesel car']

这应该会为您提供与您的函数完全相同的输出（这里的输出略有不同，但我相信这是由于 spacy 版本不同所致）。

如果您真的很喜欢冒险，可以使用列表理解

out = [' '.join(k.lemma_ for k in j) 
       for i, j in itertools.groupby(doc, key=lambda i: i.pos_) 
       if i in ("PROPN", "NOUN")]

请注意，我看到不同 spacy 版本的结果略有不同。上面的输出来自spacy-2.1.8

在字符串中查找连续连接的名词或代词

Find successively connected nouns or pronouns in string

python

pos-tagger

spacy