训练数据opennlp分类器应该有多少行和文档

How many lines and documents should be there in the training data opennlp categorizer

我正在关注documentation for Apache open-nlp。我能够理解句子检测、分词器、名称查找器。但是我被分类器卡住了。我无法理解的原因是如何创建分类模型。

我知道我需要创建一个文件。格式很清楚,需要是一个类别space和一个单行的文档。使用 .train 扩展名保存文件。

所以我创建了以下文件:

Refund What is the refund status for my order #342 ?
NewOffers Are there any new offers for your products ?

我下了这个命令-

opennlp DoccatTrainer -model en-doccat.bin -lang en -data en-doccat.train -encoding UTF-8

它开始做某事然后 returns 出现错误。这些是命令提示符中的内容:

Indexing events using cutoff of 5

    Computing event counts...  done. 2 events
    Indexing...  Dropped event Refund:[bow=What, bow=is, bow=the, bow=refund, bow=status, bow=for, bow=my, bow=order, bow=#342, bow=?]
Dropped event NewOffers:[bow=Are, bow=there, bow=any, bow=new, bow=offers, bow=for, bow=your, bow=products, bow=?]
done.
Sorting and merging events... Done indexing.
Incorporating indexed data for training...  
Exception in thread "main" java.lang.NullPointerException
    at opennlp.maxent.GISTrainer.trainModel(GISTrainer.java:263)
    at opennlp.maxent.GIS.trainModel(GIS.java:256)
    at opennlp.model.TrainUtil.train(TrainUtil.java:184)
    at opennlp.tools.doccat.DocumentCategorizerME.train(DocumentCategorizerME.java:162)
    at opennlp.tools.cmdline.doccat.DoccatTrainerTool.run(DoccatTrainerTool.java:61)
    at opennlp.tools.cmdline.CLI.main(CLI.java:222)

我只是想不通为什么这里会出现空指针异常?我也尝试增加两行,但没有结果。

Refund What is the refund status for my order #342 ?
NewOffers Are there any new offers for your products ?
Refund Can I place a refund request for electronics ?
NewOffers Is there any new offer on buying worth 5000 ?  

我找到了 this blog,但这里也完成了几乎相同的事情。在尝试他的训练文件时,它很有魅力。我的文件有什么问题?我该如何解决错误。

当我尝试 opennlp DoccatTrainer 时,它会为我打开帮助,所以路径不是问题。感谢任何帮助。

编辑:我将文件更改为

Refund What is the refund status for my order #342 ? Can I place a refund request for clothes ?
NewOffers Are there any new offers for your products ? what are the offers on new products or new offers on old products?
Refund Can I place a refund request for electronics ?
NewOffers Is there any new offer on buying worth 5000 ? 

它起作用了,我认为它必须对文档做些什么(显然应该是两句话)并删除了最后两行。

做到

Refund What is the refund status for my order #342 ? Can I place a refund request for clothes ?
NewOffers Are there any new offers for your products ? what are the offers on new products or new offers on old products? 

但又失败了,现在的问题总结为需要什么样的数据/format/document?

谢谢

您必须从每个类别中添加 5 个以上的样本。因为默认的截止标记大小是 5,

请参考这篇博文post http://madhawagunasekara.blogspot.com/2014/11/nlp-categorizer.html

您可以在 DoccatTrainer 命令中使用 -cutoff 标志来更改默认值。在您的情况下,您将添加 -cutoff 1 以将每个类别的最小文档数设置为 1。