自定义 NER 模型 - 失败

Question

我是 NLP 领域的新手，正在使用 OpenNLP 1.5 入门。

我查看了此处文档中给出的一些命令： https://opennlp.apache.org/documentation/manual/opennlp.html
（我正在使用命令行界面开始）

我使用已经可用的示例模型来试验不同的工具，最终决定创建一个自定义 NER 模型。

我按照上述 link 中给出的说明进行操作。

将给出的例句复制到.train文件中（我只是用那个扩展名创建了一个新文件并将内容粘贴到其中）：

<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group .

我使用以下命令制作模型：

bin/opennlp TokenNameFinderTrainer -model en-ner-person2.bin -lang en -data en-ner-person2.train -encoding UTF-8

问题是即使正在创建模型，它似乎也无法正常工作。使用新创建的模型对此进行了测试： bin/opennlp TokenNameFinder en-ner-person2.bin

但是当我输入 Pierre Vinken 时，它没有被识别为一个人。我还尝试从具有完全相同内容的 .txt 文件创建模型，但也失败了。

我做错了什么？

TIA。

Answer 1

简而言之 - 您不能期望统计模型只从两个句子中学习。再添加 14,998 就可以了。

The training data should contain at least 15000 sentences to create a model which performs wel

CRF（Conditional Random Fields）就是这样的统计模型，它们确实需要大量数据来弄清楚游戏规则，它们不仅仅是"remembering" 他们在训练阶段看到了什么，所以即使你从训练集中询问一些东西，他们也可能无法提供答案。

自定义 NER 模型 - 失败

Custom NER model - FAIL

nlp

named-entity-recognition

opennlp