如何将简单的训练样式数据转换为 spaCy 的命令行 JSON 格式？

Question

我在 spaCy 文档的 "Training an additional entity type" 部分有新 NER 类型的训练数据。

TRAIN_DATA = [
    ("Horses are too tall and they pretend to care about your feelings", {
        'entities': [(0, 6, 'ANIMAL')]
    }),

    ("Do they bite?", {
        'entities': []
    }),

    ("horses are too tall and they pretend to care about your feelings", {
        'entities': [(0, 6, 'ANIMAL')]
    }),

    ("horses pretend to care about your feelings", {
        'entities': [(0, 6, 'ANIMAL')]
    }),

    ("they pretend to care about your feelings, those horses", {
        'entities': [(48, 54, 'ANIMAL')]
    }),

    ("horses?", {
        'entities': [(0, 6, 'ANIMAL')]
    })
]

我想使用 spacy command line application. This requires data in spaCy's JSON format 在此数据上训练 NER 模型。如何以这种 JSON 格式编写上述数据（即带有标记字符偏移跨度的文本）？

在查看了该格式的文档后，我不清楚如何以这种格式手动写入数据。（例如，我是否将所有内容都划分为段落？）还有一个 convert 命令行实用程序，可将非 spaCy 数据格式转换为 spaCy 格式，但它不像上面那样采用 spaCy 格式作为输入。

我了解使用 "Simple training style" 的 NER 训练代码示例，但我希望能够使用命令行实用程序进行训练。（虽然从我的可以看出，我不清楚什么时候应该使用那种风格以及什么时候应该使用命令行。）

有人可以向我展示 "spaCy's JSON format" 中上述数据的示例，或者指向解释如何进行此转换的文档。

Answer 1

spaCy 有一个内置函数可以帮助您完成大部分工作：

from spacy.gold import biluo_tags_from_offsets

它接受你那里的 "offset" 类型注释，并将它们转换为 token-by-token BILOU 格式。

要将 NER 注释放入最终训练 JSON 格式，您只需要将它们包裹起来以填充数据所需的其他插槽：

sentences = []
for t in TRAIN_DATA:
    doc = nlp(t[0])
    tags = biluo_tags_from_offsets(doc, t[1]['entities'])
    ner_info = list(zip(doc, tags))
    tokens = []
    for n, i in enumerate(ner_info):
        token = {"head" : 0,
        "dep" : "",
        "tag" : "",
        "orth" : i[0].string,
        "ner" : i[1],
        "id" : n}
        tokens.append(token)
    sentences.append(tokens)

确保在使用此数据进行训练之前禁用 non-NER 管道。我已经运行在 NER-only 数据上使用 spacy train 遇到了一些问题。有关一些可能的解决方法，请参阅 Prodigy 论坛上的 #1907 and also check out this discussion。

如何将简单的训练样式数据转换为 spaCy 的命令行 JSON 格式？

How do I convert simple training style data to spaCy's command line JSON format?

spacy