ValueError: [E024] Could not find an optimal move to supervise the parser
ValueError: [E024] Could not find an optimal move to supervise the parser
我在使用自定义训练数据训练 spacy
NER 模型时出现以下错误。
ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means the GoldParse was not correct. For example, are all labels added to the model?
谁能帮我解决这个问题?
通过下面这个函数传递训练数据工作正常,没有任何错误。
def trim_entity_spans(data: list) -> list:
"""Removes leading and trailing white spaces from entity spans.
Args:
data (list): The data to be cleaned in spaCy JSON format.
Returns:
list: The cleaned data.
"""
invalid_span_tokens = re.compile(r'\s')
cleaned_data = []
for text, annotations in data:
entities = annotations['entities']
valid_entities = []
for start, end, label in entities:
valid_start = start
valid_end = end
while valid_start < len(text) and invalid_span_tokens.match(
text[valid_start]):
valid_start += 1
while valid_end > 1 and invalid_span_tokens.match(
text[valid_end - 1]):
valid_end -= 1
valid_entities.append([valid_start, valid_end, label])
cleaned_data.append([text, {'entities': valid_entities}])
return cleaned_data
当您的注释中有空内容(数据)时会发生这种情况。空数据的示例可能包括标签、标签、标签的起点和终点。上面提供的解决方案应该适用于 trimming/cleansing 数据。但是,如果您想要一种蛮力方法,只需在更新模型之前包含一个异常处理程序,如下所示:
def train_spacy(data,iterations):
nlp = spacy.blank('en') # create blank Language class
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
#add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get('entities'):
ner.add_label(ent[2])
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(iterations):
print("Starting iteration " + str(itn))
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
try:
nlp.update(
[text],
[annotations],
drop=0.2,
sgd=optimizer,
losses=losses)
except Exception as error:
print(error)
continue
print(losses)
return nlp
因此假设您的 TRAIN_DATA 包含 1000 行并且只有第 200 行有空数据,而不是模型抛出错误,它总是会跳过第 200 行并训练剩余的数据。
我在使用自定义训练数据训练 spacy
NER 模型时出现以下错误。
ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means the GoldParse was not correct. For example, are all labels added to the model?
谁能帮我解决这个问题?
通过下面这个函数传递训练数据工作正常,没有任何错误。
def trim_entity_spans(data: list) -> list:
"""Removes leading and trailing white spaces from entity spans.
Args:
data (list): The data to be cleaned in spaCy JSON format.
Returns:
list: The cleaned data.
"""
invalid_span_tokens = re.compile(r'\s')
cleaned_data = []
for text, annotations in data:
entities = annotations['entities']
valid_entities = []
for start, end, label in entities:
valid_start = start
valid_end = end
while valid_start < len(text) and invalid_span_tokens.match(
text[valid_start]):
valid_start += 1
while valid_end > 1 and invalid_span_tokens.match(
text[valid_end - 1]):
valid_end -= 1
valid_entities.append([valid_start, valid_end, label])
cleaned_data.append([text, {'entities': valid_entities}])
return cleaned_data
当您的注释中有空内容(数据)时会发生这种情况。空数据的示例可能包括标签、标签、标签的起点和终点。上面提供的解决方案应该适用于 trimming/cleansing 数据。但是,如果您想要一种蛮力方法,只需在更新模型之前包含一个异常处理程序,如下所示:
def train_spacy(data,iterations):
nlp = spacy.blank('en') # create blank Language class
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
#add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get('entities'):
ner.add_label(ent[2])
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(iterations):
print("Starting iteration " + str(itn))
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
try:
nlp.update(
[text],
[annotations],
drop=0.2,
sgd=optimizer,
losses=losses)
except Exception as error:
print(error)
continue
print(losses)
return nlp
因此假设您的 TRAIN_DATA 包含 1000 行并且只有第 200 行有空数据,而不是模型抛出错误,它总是会跳过第 200 行并训练剩余的数据。