SpaCy:如何将自定义 NER 标签添加到预训练模型中?
SpaCy: how do you add custom NER labels to a pre-trained model?
我是 SpaCy 和 NLP 的新手。我正在使用 SpaCy v 3.1 和 Python 3.9.7 64 位。
My objective: 使用预训练的 SpaCy 模型 (en_core_web_sm
) 并将一组自定义标签添加到现有的 NER 标签 ( GPE
、PERSON
、MONEY
等)以便模型可以识别默认实体和自定义实体。
我查看了 SpaCy 文档,我需要的似乎是 EntityRecogniser,特别是新管道。
但是,我不太清楚我应该在我的工作流程中的什么时候添加这个新管道,因为在 SpaCy 3 中,训练是在 CLI 中进行的,而且从文档中我什至不清楚预-训练模型被调用。
非常感谢您提供的任何教程或指导。
这是我认为应该做的,但我不确定如何做:
import spacy
from spacy import displacy
from spacy_langdetect import LanguageDetector
from spacy.language import Language
from spacy.pipeline import EntityRecognizer
# Load model
nlp = spacy.load("en_core_web_sm")
# Register custom component and turn a simple function into a pipeline component
@Language.factory('new-ner')
def create_bespoke_ner(nlp, name):
# Train the new pipeline with custom labels here??
return LanguageDetector()
# Add custom pipe
custom = nlp.add_pipe("new-ner")
到目前为止,这是我的配置文件的样子。我怀疑我的新管道需要放在“tok2vec”和“ner”旁边。
[paths]
train = null
dev = null
vectors = null
init_tok2vec = null
[system]
gpu_allocator = null
seed = 0
[nlp]
lang = "en"
pipeline = ["tok2vec","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
[components]
[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
update_with_oracle_cut_size = 100
对于 Spacy 3.2,我是这样做的:
import spacy
import random
from spacy import util
from spacy.tokens import Doc
from spacy.training import Example
from spacy.language import Language
def print_doc_entities(_doc: Doc):
if _doc.ents:
for _ent in _doc.ents:
print(f" {_ent.text} {_ent.label_}")
else:
print(" NONE")
def customizing_pipeline_component(nlp: Language):
# NOTE: Starting from Spacy 3.0, training via Python API was changed. For information see - https://spacy.io/usage/v3#migrating-training-python
train_data = [
('We need to deliver it to Festy.', [(25, 30, 'DISTRICT')]),
('I like red oranges', [])
]
# Result before training
print(f"\nResult BEFORE training:")
doc = nlp(u'I need a taxi to Festy.')
print_doc_entities(doc)
# Disable all pipe components except 'ner'
disabled_pipes = []
for pipe_name in nlp.pipe_names:
if pipe_name != 'ner':
nlp.disable_pipes(pipe_name)
disabled_pipes.append(pipe_name)
print(" Training ...")
optimizer = nlp.create_optimizer()
for _ in range(25):
random.shuffle(train_data)
for raw_text, entity_offsets in train_data:
doc = nlp.make_doc(raw_text)
example = Example.from_dict(doc, {"entities": entity_offsets})
nlp.update([example], sgd=optimizer)
# Enable all previously disabled pipe components
for pipe_name in disabled_pipes:
nlp.enable_pipe(pipe_name)
# Result after training
print(f"Result AFTER training:")
doc = nlp(u'I need a taxi to Festy.')
print_doc_entities(doc)
def main():
nlp = spacy.load('en_core_web_sm')
customizing_pipeline_component(nlp)
if __name__ == '__main__':
main()
我是 SpaCy 和 NLP 的新手。我正在使用 SpaCy v 3.1 和 Python 3.9.7 64 位。
My objective: 使用预训练的 SpaCy 模型 (en_core_web_sm
) 并将一组自定义标签添加到现有的 NER 标签 ( GPE
、PERSON
、MONEY
等)以便模型可以识别默认实体和自定义实体。
我查看了 SpaCy 文档,我需要的似乎是 EntityRecogniser,特别是新管道。
但是,我不太清楚我应该在我的工作流程中的什么时候添加这个新管道,因为在 SpaCy 3 中,训练是在 CLI 中进行的,而且从文档中我什至不清楚预-训练模型被调用。
非常感谢您提供的任何教程或指导。
这是我认为应该做的,但我不确定如何做:
import spacy
from spacy import displacy
from spacy_langdetect import LanguageDetector
from spacy.language import Language
from spacy.pipeline import EntityRecognizer
# Load model
nlp = spacy.load("en_core_web_sm")
# Register custom component and turn a simple function into a pipeline component
@Language.factory('new-ner')
def create_bespoke_ner(nlp, name):
# Train the new pipeline with custom labels here??
return LanguageDetector()
# Add custom pipe
custom = nlp.add_pipe("new-ner")
到目前为止,这是我的配置文件的样子。我怀疑我的新管道需要放在“tok2vec”和“ner”旁边。
[paths]
train = null
dev = null
vectors = null
init_tok2vec = null
[system]
gpu_allocator = null
seed = 0
[nlp]
lang = "en"
pipeline = ["tok2vec","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
[components]
[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
update_with_oracle_cut_size = 100
对于 Spacy 3.2,我是这样做的:
import spacy
import random
from spacy import util
from spacy.tokens import Doc
from spacy.training import Example
from spacy.language import Language
def print_doc_entities(_doc: Doc):
if _doc.ents:
for _ent in _doc.ents:
print(f" {_ent.text} {_ent.label_}")
else:
print(" NONE")
def customizing_pipeline_component(nlp: Language):
# NOTE: Starting from Spacy 3.0, training via Python API was changed. For information see - https://spacy.io/usage/v3#migrating-training-python
train_data = [
('We need to deliver it to Festy.', [(25, 30, 'DISTRICT')]),
('I like red oranges', [])
]
# Result before training
print(f"\nResult BEFORE training:")
doc = nlp(u'I need a taxi to Festy.')
print_doc_entities(doc)
# Disable all pipe components except 'ner'
disabled_pipes = []
for pipe_name in nlp.pipe_names:
if pipe_name != 'ner':
nlp.disable_pipes(pipe_name)
disabled_pipes.append(pipe_name)
print(" Training ...")
optimizer = nlp.create_optimizer()
for _ in range(25):
random.shuffle(train_data)
for raw_text, entity_offsets in train_data:
doc = nlp.make_doc(raw_text)
example = Example.from_dict(doc, {"entities": entity_offsets})
nlp.update([example], sgd=optimizer)
# Enable all previously disabled pipe components
for pipe_name in disabled_pipes:
nlp.enable_pipe(pipe_name)
# Result after training
print(f"Result AFTER training:")
doc = nlp(u'I need a taxi to Festy.')
print_doc_entities(doc)
def main():
nlp = spacy.load('en_core_web_sm')
customizing_pipeline_component(nlp)
if __name__ == '__main__':
main()