ValueError: [E024] Could not find an optimal move to supervise the parser

ValueError: [E024] Could not find an optimal move to supervise the parser

我在使用自定义训练数据训练 spacy NER 模型时出现以下错误。

ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means the GoldParse was not correct. For example, are all labels added to the model?

谁能帮我解决这个问题?

通过下面这个函数传递训练数据工作正常,没有任何错误。

def trim_entity_spans(data: list) -> list:
    """Removes leading and trailing white spaces from entity spans.

    Args:
        data (list): The data to be cleaned in spaCy JSON format.

    Returns:
        list: The cleaned data.
    """
    invalid_span_tokens = re.compile(r'\s')

    cleaned_data = []
    for text, annotations in data:
        entities = annotations['entities']
        valid_entities = []
        for start, end, label in entities:
            valid_start = start
            valid_end = end
            while valid_start < len(text) and invalid_span_tokens.match(
                    text[valid_start]):
                valid_start += 1
            while valid_end > 1 and invalid_span_tokens.match(
                    text[valid_end - 1]):
                valid_end -= 1
            valid_entities.append([valid_start, valid_end, label])
        cleaned_data.append([text, {'entities': valid_entities}])

    return cleaned_data

当您的注释中有空内容(数据)时会发生这种情况。空数据的示例可能包括标签、标签、标签的起点和终点。上面提供的解决方案应该适用于 trimming/cleansing 数据。但是,如果您想要一种蛮力方法,只需在更新模型之前包含一个异常处理程序,如下所示:

def train_spacy(data,iterations):
    nlp = spacy.blank('en')  # create blank Language class
    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True) 

    #add labels
    for _, annotations in TRAIN_DATA:
          for ent in annotations.get('entities'):
            ner.add_label(ent[2])
          
    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(iterations):
            print("Starting iteration " + str(itn))
            random.shuffle(TRAIN_DATA)
            losses = {}
            for text, annotations in TRAIN_DATA:
                try:
                    nlp.update(
                        [text],  
                        [annotations],  
                        drop=0.2,  
                        sgd=optimizer,  
                        losses=losses)
                except Exception as error:
                    print(error)
                    continue
            print(losses)
    return nlp

因此假设您的 TRAIN_DATA 包含 1000 行并且只有第 200 行有空数据,而不是模型抛出错误,它总是会跳过第 200 行并训练剩余的数据。