在 Spacy 中标记命名实体

Tokenizing Named Entities in Spacy

谁能帮忙。

我正在尝试使用 Spacy 对文档进行标记化,从而对命名实体进行标记化。例如:

'New York is a city in the United States of America'

将被标记为:

['New York', 'is', 'a', 'city', 'in', 'the', 'United States of America']

非常欢迎任何有关如何执行此操作的提示。看过使用 span.merge(),但没有成功,但我是编码新手,所以可能遗漏了一些东西。

提前致谢

使用 doc.retokenize context manager to merge entity spans into single tokens. Wrap this in a custom pipeline component,并将组件添加到您的语言模型。

import spacy

class EntityRetokenizeComponent:
  def __init__(self, nlp):
    pass
  def __call__(self, doc):
    with doc.retokenize() as retokenizer:
        for ent in doc.ents:
            retokenizer.merge(doc[ent.start:ent.end], attrs={"LEMMA": str(doc[ent.start:ent.end])})
    return doc

nlp = spacy.load('en')
retokenizer = EntityRetokenizeComponent(nlp) 
nlp.add_pipe(retokenizer, name='merge_phrases', last=True)

doc = nlp("German Chancellor Angela Merkel and US President Barack Obama "
          "converse in the Oval Office inside the White House in Washington, D.C.")

[tok for tok in doc]

#[German,
# Chancellor,
# Angela Merkel,
# and,
# US,
# President,
# Barack Obama,
# converse,
# in,
# the Oval Office,
# inside,
# the White House,
# in,
# Washington,
# ,,
# D.C.]

更新:Spacy v3 已包含所需的功能。

我也收回了我对 'president-elect' 的期望(正如我在下面的原始答案中所介绍的那样)保持单一实体(但这可能取决于您的用例)

from pprint import pprint

import spacy
from spacy.language import Language


# Disabling components not needed (optional, but useful if run on a large dataset)
nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "parser", "senter", "lemmatizer", "tagger", "attribute_ruler"])
nlp.add_pipe("merge_noun_chunks")
nlp.add_pipe("merge_entities")
print('Pipeline components included: ', nlp.pipe_names)

example_text = "German Chancellor Angela Merkel and US President Barack Obama (Donald Trump as president-elect') converse in the Oval Office inside the White House in Washington, D.C."

print('Tokens: ')
pprint([
    token.text
    for token in nlp(example_text)
    # Including some of the conditions I find useful
    if not (
        token.is_punct
        or
        token.is_space
    )
])

继续阅读原始答案:


我想我已经重新解释了适用于 spaCy v3 的公认答案。它似乎至少产生相同的输出。

On a side note, I noticed that it broke up a phrase such as "president-elect" into 3. I am ammending the original answer's example to include that in the hopes that someone will comment with an ammendment (or新答案)


import spacy
from spacy.language import Language


class EntityRetokenizeComponent:
    def __init__(self, nlp):
        pass

    def __call__(self, doc):
        with doc.retokenize() as retokenizer:
            for ent in doc.ents:
                retokenizer.merge(doc[ent.start:ent.end], attrs={"LEMMA": str(doc[ent.start:ent.end])})
        return doc

@Language.factory("entity_retokenizer_component")
def create_entity_retokenizer_component(nlp, name):
    return EntityRetokenizeComponent(nlp)

nlp = spacy.load("en_core_web_sm")  # You might want to load something else here, see docs
nlp.add_pipe("entity_retokenizer_component", name='merge_phrases', last=True)

text = "German Chancellor Angela Merkel and US President Barack Obama (Donald Trump as president-elect') converse in the Oval Office inside the White House in Washington, D.C."
tokened_text = [token.text for token in nlp(text)]

print(tokened_text)

# ['German',
#  'Chancellor',
#  'Angela Merkel',
#  'and',
#  'US',
#  'President',
#  'Barack Obama',
#  '(',
#  'Donald Trump',
#  'as',
#  'president',
#  '-',
#  'elect',
#  ')',
#  'converse',
#  'in',
#  'the Oval Office',
#  'inside',
#  'the White House',
#  'in',
#  'Washington',
#  ',',
#  'D.C.']