在 Java 中为 Maxent 分类器创建训练数据

Creating training data for a Maxent classfier in Java

我正在尝试为 maxent classifier 创建 java 实现。我需要将句子 class 化为 n 不同的 classes。

我看过 ColumnDataClassifier in stanford maxent classifier。但我无法理解如何创建训练数据。我需要训练数据的形式,其中训练数据包括句子中单词的 POS 标签,因此用于 classifier 的特征将类似于 previous word,next word etc.

我正在寻找训练数据,其中包含带有 POS TAGGING 的句子和提到的句子 class。示例:

我/(POS) 姓名/(POS) 是/(POS) XYZ/(POS) CLASS

任何帮助将不胜感激。

如果我没理解错的话,你是想把句子当作一组 POS 标签。

在您的示例中,句子 "My name is XYZ" 将表示为一组 (PRP$, NN, VBZ, NNP)。 这意味着,每个句子实际上是一个 binary 长度为 37 的向量(因为整个句子有 36 possible POS tags according to this page + CLASS 结果特征)

这可以为 OpenNLP Maxent 编码如下:

PRP$=1 NN=1 VBZ=1 NNP=1 CLASS=SomeClassOfYours1

或简单地:

PRP$ NN VBZ NNP CLASS=SomeClassOfYours1

(对于工作代码片段,请在此处查看我的回答:Training models using openNLP maxent

更多示例数据为:

  1. "By 1978, Radio City had lost its glamour, and the owners of Rockefeller Center decided to demolish the aging hall."
  2. "In time he was entirely forgotten, many of his buildings were demolished, others insensitively altered."
  3. "As soon as she moved out, the mobile home was demolished, the suit said."
  4. ...

这将产生样本:

IN CD NNP VBD VBN PRP$ NN CC DT NNS IN TO VB VBG CLASS=SomeClassOfYours2
IN NN PRP VBD RB VBN JJ IN PRP$ NNS CLASS=SomeClassOfYours3
IN RB PRP VBD RP DT JJ NN VBN NN CLASS=SomeClassOfYours2
...

但是,我不希望这样的分类产生好的结果。最好利用句子的其他结构特征,例如可以使用例如获得的解析树或依赖树。 Stanford parser.

2016 年 3 月 28 日编辑: 您还可以将整个句子用作训练样本。但是,请注意: - 两个句子可能包含相同的词但具有不同的含义 - 过度拟合的可能性很高 - 你应该使用短句 - 你需要一个庞大的训练集

根据你的例子,我会将训练样本编码如下:

class=CLASS My_PRP name_NN is_VBZ XYZ_NNP
...

请注意,结果变量作为每行的第一个元素出现。

这是一个使用 opennlp-maxent-3.0.3.jar.

的完全可用的最小示例
package my.maxent;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.zip.GZIPInputStream;

import opennlp.maxent.GIS;
import opennlp.maxent.io.GISModelReader;
import opennlp.maxent.io.SuffixSensitiveGISModelWriter;
import opennlp.model.AbstractModel;
import opennlp.model.AbstractModelWriter;
import opennlp.model.DataIndexer;
import opennlp.model.DataReader;
import opennlp.model.FileEventStream;
import opennlp.model.MaxentModel;
import opennlp.model.OnePassDataIndexer;
import opennlp.model.PlainTextFileDataReader;

public class MaxentTest {


    public static void main(String[] args) throws IOException {

        String trainingFileName = "training-file.txt";
        String modelFileName = "trained-model.maxent.gz";

        // Training a model from data stored in a file.
        // The training file contains one training sample per line.
        DataIndexer indexer = new OnePassDataIndexer( new FileEventStream(trainingFileName)); 
        MaxentModel trainedMaxentModel = GIS.trainModel(100, indexer); // 100 iterations

        // Storing the trained model into a file for later use (gzipped)
        File outFile = new File(modelFileName);
        AbstractModelWriter writer = new SuffixSensitiveGISModelWriter((AbstractModel) trainedMaxentModel, outFile);
        writer.persist();

        // Loading the gzipped model from a file
        FileInputStream inputStream = new FileInputStream(modelFileName);
        InputStream decodedInputStream = new GZIPInputStream(inputStream);
        DataReader modelReader = new PlainTextFileDataReader(decodedInputStream);
        MaxentModel loadedMaxentModel = new GISModelReader(modelReader).getModel();

        // Now predicting the outcome using the loaded model
        String[] context = {"is_VBZ", "Gaby_NNP"};
        double[] outcomeProbs = loadedMaxentModel.eval(context);

        String outcome = loadedMaxentModel.getBestOutcome(outcomeProbs);
        System.out.println("=======================================");
        System.out.println(outcome);
        System.out.println("=======================================");
    }

}

和一些虚拟训练数据(存储为training-file.txt):

class=Male      My_PRP name_NN is_VBZ John_NNP
class=Male      My_PRP name_NN is_VBZ Peter_NNP
class=Female    My_PRP name_NN is_VBZ Anna_NNP
class=Female    My_PRP name_NN is_VBZ Gaby_NNP

这会产生以下输出:

Indexing events using cutoff of 0
Computing event counts...  done. 4 events
Indexing...  done.
Sorting and merging events... done. Reduced 4 events to 4.
Done indexing.
Incorporating indexed data for training...  
done.
    Number of Event Tokens: 4
        Number of Outcomes: 2
      Number of Predicates: 7
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-2.772588722239781  0.5
  2:  ... loglikelihood=-2.4410105407571203 1.0
      ...
 99:  ... loglikelihood=-0.16111520541752372    1.0
100:  ... loglikelihood=-0.15953272940719138    1.0
=======================================
class=Female
=======================================