spaCy 如何用 IOB 格式的实体初始化 Doc?

spaCy How to initialize a Doc with entities in IOB format?

在我的 spaCy 项目中,我想用文本、标签和空格初始化一个 Doc 对象。 然而,spaCy 并不欣赏我提供标签的方式,并在以下错误消息中表明它缺乏欣赏:

doc = Doc(nlp.vocab, words=token_texts, ents=labels, spaces=whitespaces)
  File "spacy\tokens\doc.pyx", line 297, in spacy.tokens.doc.Doc.__init__
ValueError: [E177] Ill-formed IOB input detected: ('', 'O')

代码:

import spacy
from spacy.tokens import Doc
nlp = spacy.load("en_core_web_sm")

token_texts = ["I", "like", "potatoes", "!"]
labels = [("", "O"), ("", "O"), ("food", "I"), ("", "O")]
whitespaces = [True, True, False, False]
doc = Doc(nlp.vocab, words=token_texts, ents=labels, spaces=whitespaces)

有谁知道如何准确地为银盘上的实体提供 spaCy 服务?

spaCy Doc documentation

ents: A list of strings, of the same length of words, to assign the token-based IOB tag. Defaults to None. Optional[List[str]]

类型提示 List[str] 让我尝试 ["", "", "food", ""],但结果是相同的错误消息。

没有答案的 Whosebug 链接:

Convert NER SpaCy format to IOB format

Failed to convert iob to spaCy binary format

IOB 标签的格式应与 CoNLL 文件中使用的格式相同,例如“B-PERSON”。所以在你的示例代码中:

labels = ["O", "O", "I-FOOD", "O"]