ValueError: [E024] Could not find an optimal move to supervise the parser

Question

我在使用自定义训练数据训练 spacy NER 模型时出现以下错误。

ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means the GoldParse was not correct. For example, are all labels added to the model?

谁能帮我解决这个问题？

Answer 1

通过下面这个函数传递训练数据工作正常，没有任何错误。

def trim_entity_spans(data: list) -> list:
    """Removes leading and trailing white spaces from entity spans.

    Args:
        data (list): The data to be cleaned in spaCy JSON format.

    Returns:
        list: The cleaned data.
    """
    invalid_span_tokens = re.compile(r'\s')

    cleaned_data = []
    for text, annotations in data:
        entities = annotations['entities']
        valid_entities = []
        for start, end, label in entities:
            valid_start = start
            valid_end = end
            while valid_start < len(text) and invalid_span_tokens.match(
                    text[valid_start]):
                valid_start += 1
            while valid_end > 1 and invalid_span_tokens.match(
                    text[valid_end - 1]):
                valid_end -= 1
            valid_entities.append([valid_start, valid_end, label])
        cleaned_data.append([text, {'entities': valid_entities}])

    return cleaned_data

Answer 2

当您的注释中有空内容（数据）时会发生这种情况。空数据的示例可能包括标签、标签、标签的起点和终点。上面提供的解决方案应该适用于 trimming/cleansing 数据。但是，如果您想要一种蛮力方法，只需在更新模型之前包含一个异常处理程序，如下所示：

def train_spacy(data,iterations):
    nlp = spacy.blank('en')  # create blank Language class
    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True) 

    #add labels
    for _, annotations in TRAIN_DATA:
          for ent in annotations.get('entities'):
            ner.add_label(ent[2])
          
    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(iterations):
            print("Starting iteration " + str(itn))
            random.shuffle(TRAIN_DATA)
            losses = {}
            for text, annotations in TRAIN_DATA:
                try:
                    nlp.update(
                        [text],  
                        [annotations],  
                        drop=0.2,  
                        sgd=optimizer,  
                        losses=losses)
                except Exception as error:
                    print(error)
                    continue
            print(losses)
    return nlp

因此假设您的 TRAIN_DATA 包含 1000 行并且只有第 200 行有空数据，而不是模型抛出错误，它总是会跳过第 200 行并训练剩余的数据。

ValueError: [E024] Could not find an optimal move to supervise the parser

ValueError: [E024] Could not find an optimal move to supervise the parser

python

nlp

named-entity-recognition

python-3.x

spacy