SpaCy 中的自定义句子边界检测

Custom sentence boundary detection in SpaCy

我正在尝试在 spaCy 中编写一个自定义句子分段器,returns 整个文档作为一个句子。

我使用 here 中的代码编写了一个自定义管道组件。

不过我无法让它工作,因为它没有更改句子边界以将整个文档作为一个句子,而是抛出了两个不同的错误。

如果我创建一个空白语言实例并且只将我的自定义组件添加到管道中,我会收到此错误:

ValueError: Sentence boundary detection requires the dependency parse, which requires a statistical model to be installed and loaded.

如果我将解析器组件添加到管道

nlp = spacy.blank('es')
parser = nlp.create_pipe('parser')
nlp.add_pipe(parser, last=True)
def custom_sbd(doc):
    print("EXECUTING SBD!!!!!!!!!!!!!!!!!!!!")
    doc[0].sent_start = True
    for i in range(1, len(doc)):
        doc[i].sent_start = False
    return doc
nlp.begin_training()
nlp.add_pipe(custom_sbd, first=True)

我得到同样的错误。

如果我改变它先解析的顺序,然后再改变句子边界,错误就会变成

Refusing to write to token.sent_start if its document is parsed, because this may cause inconsistent state.

因此,如果它抛出一个错误,要求依赖解析(如果它不存在)或者它在自定义句子边界检测之后执行,并且在首先执行依赖解析时出现不同的错误,那么正确的方法是什么?

谢谢!

spaCy 的 Ines 回答了我的问题here

Thanks for bringing this up – and sorry this is a little confusing. I'm pretty sure the first problem you describe is already fixed on master. spaCy should definitely respect custom sentence boundaries, even in pipelines with no dependency parser.

If you want to use your custom SBD component without a parser, a very simple solution would be to set doc.is_parsed = True in your custom component. So when Doc.sents checks for the dependency parse, it looks at is_parsed and won't complain.

If you want to use your component with the parser, make sure to add it before the parser. The parser should always respect already set sentence boundaries from previous processing steps.