spaCy CLI 调试显示 0 train/dev 由 spacy.gold.docs_to_json 转换的 CLI 格式 JSON 文档
spaCy CLI debug shows 0 train/dev docs in CLI-formatted JSON converted by spacy.gold.docs_to_json
问题
我正在尝试 运行 spaCy CLI,但我的训练数据和开发数据似乎不正确,正如我在 运行 调试时看到的那样:
| => python3 -m spacy debug-data en
./CLI_train_randsplit_anno191022.json ./CLI_dev_randsplit_anno191022.json --pipeline ner --verbose
=========================== Data format validation ===========================
✔ Corpus is loadable
=============================== Training stats ===============================
Training pipeline: ner
Starting with blank model 'en'
0 training docs
0 evaluation docs
✔ No overlap between training and evaluation data
✘ Low number of examples to train from a blank model (0)
It's recommended to use at least 2000 examples (minimum 100)
============================== Vocab & Vectors ==============================
ℹ 0 total words in the data (0 unique)
10 most common words:
ℹ No word vectors present in the model
========================== Named Entity Recognition ==========================
ℹ 0 new labels, 0 existing labels
0 missing values (tokens with '-' label)
✔ Good amount of examples for all labels
✔ Examples without occurrences available for all labels
✔ No entities consisting of or starting/ending with whitespace
================================== Summary ==================================
✔ 5 checks passed
✘ 1 error
尝试训练无论如何都会产生:
| => python3 -m spacy train en ./models/CLI_1 ./CLI_train_randsplit_anno191022.json ./CLI_dev_randsplit_anno191022.json -n 150 -p 'ner' --verbose
dropout_from = 0.2 by default
dropout_to = 0.2 by default
dropout_decay = 0.0 by default
batch_from = 100.0 by default
batch_to = 1000.0 by default
batch_compound = 1.001 by default
Training pipeline: ['ner']
Starting with blank model 'en'
beam_width = 1 by default
beam_density = 0.0 by default
beam_update_prob = 1.0 by default
Counting training words (limit=0)
learn_rate = 0.001 by default
optimizer_B1 = 0.9 by default
optimizer_B2 = 0.999 by default
optimizer_eps = 1e-08 by default
L2_penalty = 1e-06 by default
grad_norm_clip = 1.0 by default
parser_hidden_depth = 1 by default
subword_features = True by default
conv_depth = 4 by default
bilstm_depth = 0 by default
parser_maxout_pieces = 2 by default
token_vector_width = 96 by default
hidden_width = 64 by default
embed_size = 2000 by default
Itn NER Loss NER P NER R NER F Token % CPU WPS
--- --------- ------ ------ ------ ------- -------
✔ Saved model to output directory
models/CLI_1/model-final
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/spacy/cli/train.py", line 389, in train
scorer = nlp_loaded.evaluate(dev_docs, verbose=verbose)
File "/usr/local/lib/python3.7/site-packages/spacy/language.py", line 673, in evaluate
docs, golds = zip(*docs_golds)
ValueError: not enough values to unpack (expected 2, got 0)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/site-packages/spacy/__main__.py", line 35, in <module>
plac.call(commands[command], sys.argv[1:])
File "/usr/local/lib/python3.7/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/usr/local/lib/python3.7/site-packages/plac_core.py", line 207, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/usr/local/lib/python3.7/site-packages/spacy/cli/train.py", line 486, in train
best_model_path = _collate_best_model(meta, output_path, nlp.pipe_names)
File "/usr/local/lib/python3.7/site-packages/spacy/cli/train.py", line 548, in _collate_best_model
bests[component] = _find_best(output_path, component)
File "/usr/local/lib/python3.7/site-packages/spacy/cli/train.py", line 567, in _find_best
accs = srsly.read_json(epoch_model / "accuracy.json")
File "/usr/local/lib/python3.7/site-packages/srsly/_json_api.py", line 50, in read_json
file_path = force_path(location)
File "/usr/local/lib/python3.7/site-packages/srsly/util.py", line 21, in force_path
raise ValueError("Can't read file: {}".format(location))
ValueError: Can't read file: models/CLI_1/model0/accuracy.json
我的培训和开发文档是使用 spacy.gold.docs_to_json() 生成的,使用以下函数保存为 json 文件:
def make_CLI_json(mock_docs, CLI_out_file_path):
CLI_json = docs_to_json(mock_docs)
with open(CLI_out_file_path, 'w') as json_file:
json.dump(CLI_json, json_file)
验证了它们都是有效的 json
我使用以下函数创建了这些 json 的文档:
def import_from_doccano(jx_in_file_path, view=True):
annotations = load_jsonl(jx_in_file_path)
mock_nlp = English()
sentencizer = mock_nlp.create_pipe("sentencizer")
unlabeled = 0
DATA = []
mock_docs = []
for anno in annotations:
# get DATA (as used in spacy inline training)
if "label" in anno.keys():
ents = [tuple([label[0], label[1], label[2]])
for label in anno["labels"]]
else:
ents = []
DATUM = (anno["text"], {"entities": ents})
DATA.append(DATUM)
# mock a doc for viz in displacy
mock_doc = mock_nlp(anno["text"])
if "labels" in anno.keys():
entities = anno["labels"]
if not entities:
unlabeled += 1
ents = [(e[0], e[1], e[2]) for e in entities]
spans = [mock_doc.char_span(s, e, label=L) for s, e, L in ents]
mock_doc.ents = _cleanup_spans(spans)
sentencizer(mock_doc)
if view:
displacy.render(mock_doc, style='ent')
mock_docs.append(mock_doc)
print(f'Unlabeled: {unlabeled}')
return DATA, mock_docs
我将上面的函数写到 return 内联训练所需格式的示例中(例如,如 https://github.com/explosion/spaCy/blob/master/examples/training/train_ner.py) as well as to form these kind of “mock” docs so that I can use displacy and/or the CLI. For the latter purpose, I followed the code shown at https://github.com/explosion/spaCy/blob/master/spacy/cli/converters/jsonl2json.py 所示,但有几个显着差异。_cleanup_spans( ) 函数与示例中的函数相同。我没有使用 minibatch() 而是为我的每个标记注释制作了一个单独的文档。另外,(也许是关键?)我发现使用 sentencizer 破坏了我的许多注释,可能是因为跨度以 _cleanup_spans() 函数无法正确修复的方式移动。删除 sentencizer 会导致 docs_to_json() 函数抛出错误。在我的函数中(与链接中不同示例)因此,我 运行 每个文档的 sentencizer 在 实体被写入它们之后,它正确地保留了我的注释并允许 docs_to_json() 函数 运行 毫无怨言。
在 import_from_doccano() 中调用的函数 load_jsonl 定义为:
def load_jsonl(input_path):
data = []
with open(input_path, 'r', encoding='utf-8') as f:
for line in f:
data.append(json.loads(line.replace('\n|\r',''), strict=False))
print('Loaded {} records from {}'.format(len(data), input_path))
print()
return data
我的每条注释的长度都在 ~10000 个字符或更少。它们是从 doccano
导出的
(https://doccano.herokuapp.com/) 作为 JSONL 使用格式:
{"id": 1, "text": "EU rejects ...", "labels": [[0,2,"ORG"], [11,17, "MISC"], [34,41,"ORG"]]}
{"id": 2, "text": "Peter Blackburn", "labels": [[0, 15, "PERSON"]]}
{"id": 3, "text": "President Obama", "labels": [[10, 15, "PERSON"]]}
...
使用以下函数将数据分成训练集和测试集:
def test_train_split(DATA, mock_docs, n_train):
L = list(zip(DATA, mock_docs))
random.shuffle(L)
DATA, mock_docs = zip(*L)
DATA = [i for i in DATA]
mock_docs = [i for i in mock_docs]
TRAIN_DATA = DATA[:n_train]
train_docs = mock_docs[:n_train]
TEST_DATA = DATA[n_train:]
test_docs = mock_docs[n_train:]
return TRAIN_DATA, TEST_DATA, train_docs, test_docs
最后使用以下函数将每个写入 json:
def make_CLI_json(mock_docs, CLI_out_file_path):
CLI_json = docs_to_json(mock_docs)
with open(CLI_out_file_path, 'w') as json_file:
json.dump(CLI_json, json_file)
我不明白为什么调试显示0个训练文档和0个开发文档,或者为什么train命令失败。据我所知,JSON 看起来是正确的。 是我的数据格式不正确,还是有其他问题?任何帮助或见解将不胜感激。
这是我关于 SE 的第一个问题 - 如果我未能遵循某些或其他准则,请提前致歉。涉及到很多组件,所以我不确定如何生成可以复制我的问题的最小代码示例。
环境
Mac OS 10.15 卡特琳娜
一切都是 pip3 安装到用户路径
没有虚拟环境
| => python3 -m spacy info --markdown
## Info about spaCy
* **spaCy version:** 2.2.1
* **Platform:** Darwin-19.0.0-x86_64-i386-64bit
* **Python version:** 3.7.4
这是 API 的合理混淆方面。由于 internal/historical 的原因,spacy.gold.docs_to_json()
生成的字典仍然需要用列表包装才能达到最终的训练格式。尝试:
srsly.write_json(filename, [spacy.gold.docs_to_json(docs)])
spacy debug-data
还没有正确的模式检查,所以这 frustrating/confusing 比它应该的要多。
问题
我正在尝试 运行 spaCy CLI,但我的训练数据和开发数据似乎不正确,正如我在 运行 调试时看到的那样:
| => python3 -m spacy debug-data en
./CLI_train_randsplit_anno191022.json ./CLI_dev_randsplit_anno191022.json --pipeline ner --verbose
=========================== Data format validation ===========================
✔ Corpus is loadable
=============================== Training stats ===============================
Training pipeline: ner
Starting with blank model 'en'
0 training docs
0 evaluation docs
✔ No overlap between training and evaluation data
✘ Low number of examples to train from a blank model (0)
It's recommended to use at least 2000 examples (minimum 100)
============================== Vocab & Vectors ==============================
ℹ 0 total words in the data (0 unique)
10 most common words:
ℹ No word vectors present in the model
========================== Named Entity Recognition ==========================
ℹ 0 new labels, 0 existing labels
0 missing values (tokens with '-' label)
✔ Good amount of examples for all labels
✔ Examples without occurrences available for all labels
✔ No entities consisting of or starting/ending with whitespace
================================== Summary ==================================
✔ 5 checks passed
✘ 1 error
尝试训练无论如何都会产生:
| => python3 -m spacy train en ./models/CLI_1 ./CLI_train_randsplit_anno191022.json ./CLI_dev_randsplit_anno191022.json -n 150 -p 'ner' --verbose
dropout_from = 0.2 by default
dropout_to = 0.2 by default
dropout_decay = 0.0 by default
batch_from = 100.0 by default
batch_to = 1000.0 by default
batch_compound = 1.001 by default
Training pipeline: ['ner']
Starting with blank model 'en'
beam_width = 1 by default
beam_density = 0.0 by default
beam_update_prob = 1.0 by default
Counting training words (limit=0)
learn_rate = 0.001 by default
optimizer_B1 = 0.9 by default
optimizer_B2 = 0.999 by default
optimizer_eps = 1e-08 by default
L2_penalty = 1e-06 by default
grad_norm_clip = 1.0 by default
parser_hidden_depth = 1 by default
subword_features = True by default
conv_depth = 4 by default
bilstm_depth = 0 by default
parser_maxout_pieces = 2 by default
token_vector_width = 96 by default
hidden_width = 64 by default
embed_size = 2000 by default
Itn NER Loss NER P NER R NER F Token % CPU WPS
--- --------- ------ ------ ------ ------- -------
✔ Saved model to output directory
models/CLI_1/model-final
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/spacy/cli/train.py", line 389, in train
scorer = nlp_loaded.evaluate(dev_docs, verbose=verbose)
File "/usr/local/lib/python3.7/site-packages/spacy/language.py", line 673, in evaluate
docs, golds = zip(*docs_golds)
ValueError: not enough values to unpack (expected 2, got 0)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/site-packages/spacy/__main__.py", line 35, in <module>
plac.call(commands[command], sys.argv[1:])
File "/usr/local/lib/python3.7/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/usr/local/lib/python3.7/site-packages/plac_core.py", line 207, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/usr/local/lib/python3.7/site-packages/spacy/cli/train.py", line 486, in train
best_model_path = _collate_best_model(meta, output_path, nlp.pipe_names)
File "/usr/local/lib/python3.7/site-packages/spacy/cli/train.py", line 548, in _collate_best_model
bests[component] = _find_best(output_path, component)
File "/usr/local/lib/python3.7/site-packages/spacy/cli/train.py", line 567, in _find_best
accs = srsly.read_json(epoch_model / "accuracy.json")
File "/usr/local/lib/python3.7/site-packages/srsly/_json_api.py", line 50, in read_json
file_path = force_path(location)
File "/usr/local/lib/python3.7/site-packages/srsly/util.py", line 21, in force_path
raise ValueError("Can't read file: {}".format(location))
ValueError: Can't read file: models/CLI_1/model0/accuracy.json
我的培训和开发文档是使用 spacy.gold.docs_to_json() 生成的,使用以下函数保存为 json 文件:
def make_CLI_json(mock_docs, CLI_out_file_path):
CLI_json = docs_to_json(mock_docs)
with open(CLI_out_file_path, 'w') as json_file:
json.dump(CLI_json, json_file)
验证了它们都是有效的 json
我使用以下函数创建了这些 json 的文档:
def import_from_doccano(jx_in_file_path, view=True):
annotations = load_jsonl(jx_in_file_path)
mock_nlp = English()
sentencizer = mock_nlp.create_pipe("sentencizer")
unlabeled = 0
DATA = []
mock_docs = []
for anno in annotations:
# get DATA (as used in spacy inline training)
if "label" in anno.keys():
ents = [tuple([label[0], label[1], label[2]])
for label in anno["labels"]]
else:
ents = []
DATUM = (anno["text"], {"entities": ents})
DATA.append(DATUM)
# mock a doc for viz in displacy
mock_doc = mock_nlp(anno["text"])
if "labels" in anno.keys():
entities = anno["labels"]
if not entities:
unlabeled += 1
ents = [(e[0], e[1], e[2]) for e in entities]
spans = [mock_doc.char_span(s, e, label=L) for s, e, L in ents]
mock_doc.ents = _cleanup_spans(spans)
sentencizer(mock_doc)
if view:
displacy.render(mock_doc, style='ent')
mock_docs.append(mock_doc)
print(f'Unlabeled: {unlabeled}')
return DATA, mock_docs
我将上面的函数写到 return 内联训练所需格式的示例中(例如,如 https://github.com/explosion/spaCy/blob/master/examples/training/train_ner.py) as well as to form these kind of “mock” docs so that I can use displacy and/or the CLI. For the latter purpose, I followed the code shown at https://github.com/explosion/spaCy/blob/master/spacy/cli/converters/jsonl2json.py 所示,但有几个显着差异。_cleanup_spans( ) 函数与示例中的函数相同。我没有使用 minibatch() 而是为我的每个标记注释制作了一个单独的文档。另外,(也许是关键?)我发现使用 sentencizer 破坏了我的许多注释,可能是因为跨度以 _cleanup_spans() 函数无法正确修复的方式移动。删除 sentencizer 会导致 docs_to_json() 函数抛出错误。在我的函数中(与链接中不同示例)因此,我 运行 每个文档的 sentencizer 在 实体被写入它们之后,它正确地保留了我的注释并允许 docs_to_json() 函数 运行 毫无怨言。
在 import_from_doccano() 中调用的函数 load_jsonl 定义为:
def load_jsonl(input_path):
data = []
with open(input_path, 'r', encoding='utf-8') as f:
for line in f:
data.append(json.loads(line.replace('\n|\r',''), strict=False))
print('Loaded {} records from {}'.format(len(data), input_path))
print()
return data
我的每条注释的长度都在 ~10000 个字符或更少。它们是从 doccano
导出的(https://doccano.herokuapp.com/) 作为 JSONL 使用格式:
{"id": 1, "text": "EU rejects ...", "labels": [[0,2,"ORG"], [11,17, "MISC"], [34,41,"ORG"]]}
{"id": 2, "text": "Peter Blackburn", "labels": [[0, 15, "PERSON"]]}
{"id": 3, "text": "President Obama", "labels": [[10, 15, "PERSON"]]}
...
使用以下函数将数据分成训练集和测试集:
def test_train_split(DATA, mock_docs, n_train):
L = list(zip(DATA, mock_docs))
random.shuffle(L)
DATA, mock_docs = zip(*L)
DATA = [i for i in DATA]
mock_docs = [i for i in mock_docs]
TRAIN_DATA = DATA[:n_train]
train_docs = mock_docs[:n_train]
TEST_DATA = DATA[n_train:]
test_docs = mock_docs[n_train:]
return TRAIN_DATA, TEST_DATA, train_docs, test_docs
最后使用以下函数将每个写入 json:
def make_CLI_json(mock_docs, CLI_out_file_path):
CLI_json = docs_to_json(mock_docs)
with open(CLI_out_file_path, 'w') as json_file:
json.dump(CLI_json, json_file)
我不明白为什么调试显示0个训练文档和0个开发文档,或者为什么train命令失败。据我所知,JSON 看起来是正确的。 是我的数据格式不正确,还是有其他问题?任何帮助或见解将不胜感激。
这是我关于 SE 的第一个问题 - 如果我未能遵循某些或其他准则,请提前致歉。涉及到很多组件,所以我不确定如何生成可以复制我的问题的最小代码示例。
环境
Mac OS 10.15 卡特琳娜 一切都是 pip3 安装到用户路径 没有虚拟环境
| => python3 -m spacy info --markdown
## Info about spaCy
* **spaCy version:** 2.2.1
* **Platform:** Darwin-19.0.0-x86_64-i386-64bit
* **Python version:** 3.7.4
这是 API 的合理混淆方面。由于 internal/historical 的原因,spacy.gold.docs_to_json()
生成的字典仍然需要用列表包装才能达到最终的训练格式。尝试:
srsly.write_json(filename, [spacy.gold.docs_to_json(docs)])
spacy debug-data
还没有正确的模式检查,所以这 frustrating/confusing 比它应该的要多。