在 Spacy 中标记命名实体
Tokenizing Named Entities in Spacy
谁能帮忙。
我正在尝试使用 Spacy 对文档进行标记化,从而对命名实体进行标记化。例如:
'New York is a city in the United States of America'
将被标记为:
['New York', 'is', 'a', 'city', 'in', 'the', 'United States of America']
非常欢迎任何有关如何执行此操作的提示。看过使用 span.merge(),但没有成功,但我是编码新手,所以可能遗漏了一些东西。
提前致谢
使用 doc.retokenize
context manager to merge entity spans into single tokens. Wrap this in a custom pipeline component,并将组件添加到您的语言模型。
import spacy
class EntityRetokenizeComponent:
def __init__(self, nlp):
pass
def __call__(self, doc):
with doc.retokenize() as retokenizer:
for ent in doc.ents:
retokenizer.merge(doc[ent.start:ent.end], attrs={"LEMMA": str(doc[ent.start:ent.end])})
return doc
nlp = spacy.load('en')
retokenizer = EntityRetokenizeComponent(nlp)
nlp.add_pipe(retokenizer, name='merge_phrases', last=True)
doc = nlp("German Chancellor Angela Merkel and US President Barack Obama "
"converse in the Oval Office inside the White House in Washington, D.C.")
[tok for tok in doc]
#[German,
# Chancellor,
# Angela Merkel,
# and,
# US,
# President,
# Barack Obama,
# converse,
# in,
# the Oval Office,
# inside,
# the White House,
# in,
# Washington,
# ,,
# D.C.]
更新:Spacy v3 已包含所需的功能。
我也收回了我对 'president-elect' 的期望(正如我在下面的原始答案中所介绍的那样)保持单一实体(但这可能取决于您的用例)
from pprint import pprint
import spacy
from spacy.language import Language
# Disabling components not needed (optional, but useful if run on a large dataset)
nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "parser", "senter", "lemmatizer", "tagger", "attribute_ruler"])
nlp.add_pipe("merge_noun_chunks")
nlp.add_pipe("merge_entities")
print('Pipeline components included: ', nlp.pipe_names)
example_text = "German Chancellor Angela Merkel and US President Barack Obama (Donald Trump as president-elect') converse in the Oval Office inside the White House in Washington, D.C."
print('Tokens: ')
pprint([
token.text
for token in nlp(example_text)
# Including some of the conditions I find useful
if not (
token.is_punct
or
token.is_space
)
])
继续阅读原始答案:
我想我已经重新解释了适用于 spaCy v3 的公认答案。它似乎至少产生相同的输出。
On a side note, I noticed that it broke up a phrase such as "president-elect" into 3. I am ammending the original answer's example to include that in the hopes that someone will comment with an ammendment (or新答案)
import spacy
from spacy.language import Language
class EntityRetokenizeComponent:
def __init__(self, nlp):
pass
def __call__(self, doc):
with doc.retokenize() as retokenizer:
for ent in doc.ents:
retokenizer.merge(doc[ent.start:ent.end], attrs={"LEMMA": str(doc[ent.start:ent.end])})
return doc
@Language.factory("entity_retokenizer_component")
def create_entity_retokenizer_component(nlp, name):
return EntityRetokenizeComponent(nlp)
nlp = spacy.load("en_core_web_sm") # You might want to load something else here, see docs
nlp.add_pipe("entity_retokenizer_component", name='merge_phrases', last=True)
text = "German Chancellor Angela Merkel and US President Barack Obama (Donald Trump as president-elect') converse in the Oval Office inside the White House in Washington, D.C."
tokened_text = [token.text for token in nlp(text)]
print(tokened_text)
# ['German',
# 'Chancellor',
# 'Angela Merkel',
# 'and',
# 'US',
# 'President',
# 'Barack Obama',
# '(',
# 'Donald Trump',
# 'as',
# 'president',
# '-',
# 'elect',
# ')',
# 'converse',
# 'in',
# 'the Oval Office',
# 'inside',
# 'the White House',
# 'in',
# 'Washington',
# ',',
# 'D.C.']
谁能帮忙。
我正在尝试使用 Spacy 对文档进行标记化,从而对命名实体进行标记化。例如:
'New York is a city in the United States of America'
将被标记为:
['New York', 'is', 'a', 'city', 'in', 'the', 'United States of America']
非常欢迎任何有关如何执行此操作的提示。看过使用 span.merge(),但没有成功,但我是编码新手,所以可能遗漏了一些东西。
提前致谢
使用 doc.retokenize
context manager to merge entity spans into single tokens. Wrap this in a custom pipeline component,并将组件添加到您的语言模型。
import spacy
class EntityRetokenizeComponent:
def __init__(self, nlp):
pass
def __call__(self, doc):
with doc.retokenize() as retokenizer:
for ent in doc.ents:
retokenizer.merge(doc[ent.start:ent.end], attrs={"LEMMA": str(doc[ent.start:ent.end])})
return doc
nlp = spacy.load('en')
retokenizer = EntityRetokenizeComponent(nlp)
nlp.add_pipe(retokenizer, name='merge_phrases', last=True)
doc = nlp("German Chancellor Angela Merkel and US President Barack Obama "
"converse in the Oval Office inside the White House in Washington, D.C.")
[tok for tok in doc]
#[German,
# Chancellor,
# Angela Merkel,
# and,
# US,
# President,
# Barack Obama,
# converse,
# in,
# the Oval Office,
# inside,
# the White House,
# in,
# Washington,
# ,,
# D.C.]
更新:Spacy v3 已包含所需的功能。
我也收回了我对 'president-elect' 的期望(正如我在下面的原始答案中所介绍的那样)保持单一实体(但这可能取决于您的用例)
from pprint import pprint
import spacy
from spacy.language import Language
# Disabling components not needed (optional, but useful if run on a large dataset)
nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "parser", "senter", "lemmatizer", "tagger", "attribute_ruler"])
nlp.add_pipe("merge_noun_chunks")
nlp.add_pipe("merge_entities")
print('Pipeline components included: ', nlp.pipe_names)
example_text = "German Chancellor Angela Merkel and US President Barack Obama (Donald Trump as president-elect') converse in the Oval Office inside the White House in Washington, D.C."
print('Tokens: ')
pprint([
token.text
for token in nlp(example_text)
# Including some of the conditions I find useful
if not (
token.is_punct
or
token.is_space
)
])
继续阅读原始答案:
我想我已经重新解释了适用于 spaCy v3 的公认答案。它似乎至少产生相同的输出。
On a side note, I noticed that it broke up a phrase such as "president-elect" into 3. I am ammending the original answer's example to include that in the hopes that someone will comment with an ammendment (or新答案)
import spacy
from spacy.language import Language
class EntityRetokenizeComponent:
def __init__(self, nlp):
pass
def __call__(self, doc):
with doc.retokenize() as retokenizer:
for ent in doc.ents:
retokenizer.merge(doc[ent.start:ent.end], attrs={"LEMMA": str(doc[ent.start:ent.end])})
return doc
@Language.factory("entity_retokenizer_component")
def create_entity_retokenizer_component(nlp, name):
return EntityRetokenizeComponent(nlp)
nlp = spacy.load("en_core_web_sm") # You might want to load something else here, see docs
nlp.add_pipe("entity_retokenizer_component", name='merge_phrases', last=True)
text = "German Chancellor Angela Merkel and US President Barack Obama (Donald Trump as president-elect') converse in the Oval Office inside the White House in Washington, D.C."
tokened_text = [token.text for token in nlp(text)]
print(tokened_text)
# ['German',
# 'Chancellor',
# 'Angela Merkel',
# 'and',
# 'US',
# 'President',
# 'Barack Obama',
# '(',
# 'Donald Trump',
# 'as',
# 'president',
# '-',
# 'elect',
# ')',
# 'converse',
# 'in',
# 'the Oval Office',
# 'inside',
# 'the White House',
# 'in',
# 'Washington',
# ',',
# 'D.C.']