NER 语料库的带注释训练数据

The data can be converted to the OpenNLP name finder training format. Which is one sentence per line. Some other formats are available as well. The sentence must be tokenized and contain spans which mark the entities. Documents are separated by empty lines which trigger the reset of the adaptive feature generators. A training file can contain multiple types. If the training file contains multiple types the created model will also be able to detect these multiple types. For now it is recommended to only train single type models, since multi type support is still experimental.

所以如果你使用其他答案中提到的工具之一，你需要确保opennlp可以读取该格式或将该格式转换为可以识别的格式。

很抱歉这里真的没有好的解决方法。对于我们过去的项目，我们不得不多次执行此操作，有时我们很幸运有标签员为我们工作以获取手动注释的数据集构建，其余时间我们自己完成。

Also, I am not sure you really require 15k data items, I would suggest to start from as low as 1-2k and test the performance, based on the particular case you might be surprised by the results.

现在要构建您的数据集，最初我们使用的是普通的旧 excel 工作表，但很快就变成了一场噩梦，excel 不是为此类任务设计的，而且要查看 1000 行文本在 excel 中手动注释非常痛苦。

以下是我推荐的一些工具：

Dataturks：https://dataturks.com：非常易于使用的在线工具，提供直观的 UI，您可以让一个团队同时处理数据集。输出与 openNLP、coreNLP 等完全兼容

GATE：http://gate.ac.uk/：很好的旧工具。下载到本地机器，运行良好，设置有点麻烦。

BRAT: http://brat.nlplab.org/: 一个开源工具，可下载，在标记方面做得很好。

希望这对您有所帮助，祝您标记愉快:)

NER 语料库的带注释训练数据

Annotated Training data for NER corpus

nlp

corpus

named-entity-recognition

training-data

opennlp