Spacy BILOU 格式到 spacy json 格式
Spacy BILOU format to spacy json format
我正在尝试将我的 spacy 版本升级到 nightly,特别是为了使用 spacy 转换器
所以我转换了类似
格式的 spacy 简单训练数据集
td = [["Who is Shaka Khan?", {"entities": [(7, 17, "FRIENDS")]}],["I like London.", {"entities": [(7, 13, "LOC")]}],]
以上至
[[{"head": 0, "dep": "", "tag": "", "orth": "Who", "ner": "O", "id": 0}, {"head": 0, "dep": "", "tag": "", "orth": "is", "ner": "O", "id": 1}, {"head": 0, "dep": "", "tag": "", "orth": "Shaka", "ner": "B-FRIENDS", "id": 2}, {"head": 0, "dep": "", "tag": "", "orth": "Khan", "ner": "L-FRIENDS", "id": 3}, {"head": 0, "dep": "", "tag": "", "orth": "?", "ner": "O", "id": 4}], [{"head": 0, "dep": "", "tag": "", "orth": "I", "ner": "O", "id": 0}, {"head": 0, "dep": "", "tag": "", "orth": "like", "ner": "O", "id": 1}, {"head": 0, "dep": "", "tag": "", "orth": "London", "ner": "U-LOC", "id": 2}, {"head": 0, "dep": "", "tag": "", "orth": ".", "ner": "O", "id": 3}]]
使用以下脚本
sentences = []
for t in td:
doc = nlp(t[0])
tags = offsets_to_biluo_tags(doc, t[1]['entities'])
ner_info = list(zip(doc, tags))
tokens = []
for n, i in enumerate(ner_info):
token = {"head" : 0,
"dep" : "",
"tag" : "",
"orth" : i[0].orth_,
"ner" : i[1],
"id" : n}
tokens.append(token)
sentences.append(tokens)
with open("train_data.json","w") as js:
json.dump(sentences,js)```
then i tried to convert this train_data.json using
spacy's convert command
```python -m spacy convert train_data.json converted/```
but the result in converted folder is
```✔ Generated output file (0 documents): converted/train_data.spacy```
which means it doesn't created dataset
can anybody help on what i am missing
i am trying to do this with spacy-nightly
您可以跳过中间 JSON 步骤并将注释直接转换为 DocBin
。
import spacy
from spacy.training import Example
from spacy.tokens import DocBin
td = [["Who is Shaka Khan?", {"entities": [(7, 17, "FRIENDS")]}],["I like London.", {"entities": [(7, 13, "LOC")]}],]
nlp = spacy.blank("en")
db = DocBin()
for text, annotations in td:
example = Example.from_dict(nlp.make_doc(text), annotations)
db.add(example.reference)
db.to_disk("td.spacy")
参见:https://nightly.spacy.io/usage/v3#migrating-training-python
(如果你确实想使用中间 JSON 格式,这里是规范:https://spacy.io/api/annotation#json-input . You can just include orth
and ner
in the tokens
and leave the other features out, but you need this structure with paragraphs
, raw
, and sentences
. An example is here: https://github.com/explosion/spaCy/blob/45c9a688285081cd69faa0627d9bcaf1f5e799a1/examples/training/training-data.json)
我正在尝试将我的 spacy 版本升级到 nightly,特别是为了使用 spacy 转换器
所以我转换了类似
格式的 spacy 简单训练数据集td = [["Who is Shaka Khan?", {"entities": [(7, 17, "FRIENDS")]}],["I like London.", {"entities": [(7, 13, "LOC")]}],]
以上至
[[{"head": 0, "dep": "", "tag": "", "orth": "Who", "ner": "O", "id": 0}, {"head": 0, "dep": "", "tag": "", "orth": "is", "ner": "O", "id": 1}, {"head": 0, "dep": "", "tag": "", "orth": "Shaka", "ner": "B-FRIENDS", "id": 2}, {"head": 0, "dep": "", "tag": "", "orth": "Khan", "ner": "L-FRIENDS", "id": 3}, {"head": 0, "dep": "", "tag": "", "orth": "?", "ner": "O", "id": 4}], [{"head": 0, "dep": "", "tag": "", "orth": "I", "ner": "O", "id": 0}, {"head": 0, "dep": "", "tag": "", "orth": "like", "ner": "O", "id": 1}, {"head": 0, "dep": "", "tag": "", "orth": "London", "ner": "U-LOC", "id": 2}, {"head": 0, "dep": "", "tag": "", "orth": ".", "ner": "O", "id": 3}]]
使用以下脚本
sentences = []
for t in td:
doc = nlp(t[0])
tags = offsets_to_biluo_tags(doc, t[1]['entities'])
ner_info = list(zip(doc, tags))
tokens = []
for n, i in enumerate(ner_info):
token = {"head" : 0,
"dep" : "",
"tag" : "",
"orth" : i[0].orth_,
"ner" : i[1],
"id" : n}
tokens.append(token)
sentences.append(tokens)
with open("train_data.json","w") as js:
json.dump(sentences,js)```
then i tried to convert this train_data.json using
spacy's convert command
```python -m spacy convert train_data.json converted/```
but the result in converted folder is
```✔ Generated output file (0 documents): converted/train_data.spacy```
which means it doesn't created dataset
can anybody help on what i am missing
i am trying to do this with spacy-nightly
您可以跳过中间 JSON 步骤并将注释直接转换为 DocBin
。
import spacy
from spacy.training import Example
from spacy.tokens import DocBin
td = [["Who is Shaka Khan?", {"entities": [(7, 17, "FRIENDS")]}],["I like London.", {"entities": [(7, 13, "LOC")]}],]
nlp = spacy.blank("en")
db = DocBin()
for text, annotations in td:
example = Example.from_dict(nlp.make_doc(text), annotations)
db.add(example.reference)
db.to_disk("td.spacy")
参见:https://nightly.spacy.io/usage/v3#migrating-training-python
(如果你确实想使用中间 JSON 格式,这里是规范:https://spacy.io/api/annotation#json-input . You can just include orth
and ner
in the tokens
and leave the other features out, but you need this structure with paragraphs
, raw
, and sentences
. An example is here: https://github.com/explosion/spaCy/blob/45c9a688285081cd69faa0627d9bcaf1f5e799a1/examples/training/training-data.json)