在 Java 中为 Maxent 分类器创建训练数据
Creating training data for a Maxent classfier in Java
我正在尝试为 maxent classifier 创建 java 实现。我需要将句子 class 化为 n
不同的 classes。
我看过 ColumnDataClassifier in stanford maxent classifier。但我无法理解如何创建训练数据。我需要训练数据的形式,其中训练数据包括句子中单词的 POS 标签,因此用于 classifier 的特征将类似于 previous word,next word etc.
我正在寻找训练数据,其中包含带有 POS TAGGING 的句子和提到的句子 class。示例:
我/(POS) 姓名/(POS) 是/(POS) XYZ/(POS) CLASS
任何帮助将不胜感激。
如果我没理解错的话,你是想把句子当作一组 POS 标签。
在您的示例中,句子 "My name is XYZ" 将表示为一组 (PRP$, NN, VBZ, NNP)。
这意味着,每个句子实际上是一个 binary 长度为 37 的向量(因为整个句子有 36 possible POS tags according to this page + CLASS 结果特征)
这可以为 OpenNLP Maxent 编码如下:
PRP$=1 NN=1 VBZ=1 NNP=1 CLASS=SomeClassOfYours1
或简单地:
PRP$ NN VBZ NNP CLASS=SomeClassOfYours1
(对于工作代码片段,请在此处查看我的回答:Training models using openNLP maxent)
更多示例数据为:
- "By 1978, Radio City had lost its glamour, and the owners of Rockefeller Center decided to demolish the aging hall."
- "In time he was entirely forgotten, many of his buildings were demolished, others insensitively altered."
- "As soon as she moved out, the mobile home was demolished, the suit said."
- ...
这将产生样本:
IN CD NNP VBD VBN PRP$ NN CC DT NNS IN TO VB VBG CLASS=SomeClassOfYours2
IN NN PRP VBD RB VBN JJ IN PRP$ NNS CLASS=SomeClassOfYours3
IN RB PRP VBD RP DT JJ NN VBN NN CLASS=SomeClassOfYours2
...
但是,我不希望这样的分类产生好的结果。最好利用句子的其他结构特征,例如可以使用例如获得的解析树或依赖树。 Stanford parser.
2016 年 3 月 28 日编辑:
您还可以将整个句子用作训练样本。但是,请注意:
- 两个句子可能包含相同的词但具有不同的含义
- 过度拟合的可能性很高
- 你应该使用短句
- 你需要一个庞大的训练集
根据你的例子,我会将训练样本编码如下:
class=CLASS My_PRP name_NN is_VBZ XYZ_NNP
...
请注意,结果变量作为每行的第一个元素出现。
这是一个使用 opennlp-maxent-3.0.3.jar
.
的完全可用的最小示例
package my.maxent;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.zip.GZIPInputStream;
import opennlp.maxent.GIS;
import opennlp.maxent.io.GISModelReader;
import opennlp.maxent.io.SuffixSensitiveGISModelWriter;
import opennlp.model.AbstractModel;
import opennlp.model.AbstractModelWriter;
import opennlp.model.DataIndexer;
import opennlp.model.DataReader;
import opennlp.model.FileEventStream;
import opennlp.model.MaxentModel;
import opennlp.model.OnePassDataIndexer;
import opennlp.model.PlainTextFileDataReader;
public class MaxentTest {
public static void main(String[] args) throws IOException {
String trainingFileName = "training-file.txt";
String modelFileName = "trained-model.maxent.gz";
// Training a model from data stored in a file.
// The training file contains one training sample per line.
DataIndexer indexer = new OnePassDataIndexer( new FileEventStream(trainingFileName));
MaxentModel trainedMaxentModel = GIS.trainModel(100, indexer); // 100 iterations
// Storing the trained model into a file for later use (gzipped)
File outFile = new File(modelFileName);
AbstractModelWriter writer = new SuffixSensitiveGISModelWriter((AbstractModel) trainedMaxentModel, outFile);
writer.persist();
// Loading the gzipped model from a file
FileInputStream inputStream = new FileInputStream(modelFileName);
InputStream decodedInputStream = new GZIPInputStream(inputStream);
DataReader modelReader = new PlainTextFileDataReader(decodedInputStream);
MaxentModel loadedMaxentModel = new GISModelReader(modelReader).getModel();
// Now predicting the outcome using the loaded model
String[] context = {"is_VBZ", "Gaby_NNP"};
double[] outcomeProbs = loadedMaxentModel.eval(context);
String outcome = loadedMaxentModel.getBestOutcome(outcomeProbs);
System.out.println("=======================================");
System.out.println(outcome);
System.out.println("=======================================");
}
}
和一些虚拟训练数据(存储为training-file.txt
):
class=Male My_PRP name_NN is_VBZ John_NNP
class=Male My_PRP name_NN is_VBZ Peter_NNP
class=Female My_PRP name_NN is_VBZ Anna_NNP
class=Female My_PRP name_NN is_VBZ Gaby_NNP
这会产生以下输出:
Indexing events using cutoff of 0
Computing event counts... done. 4 events
Indexing... done.
Sorting and merging events... done. Reduced 4 events to 4.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 4
Number of Outcomes: 2
Number of Predicates: 7
...done.
Computing model parameters ...
Performing 100 iterations.
1: ... loglikelihood=-2.772588722239781 0.5
2: ... loglikelihood=-2.4410105407571203 1.0
...
99: ... loglikelihood=-0.16111520541752372 1.0
100: ... loglikelihood=-0.15953272940719138 1.0
=======================================
class=Female
=======================================
我正在尝试为 maxent classifier 创建 java 实现。我需要将句子 class 化为 n
不同的 classes。
我看过 ColumnDataClassifier in stanford maxent classifier。但我无法理解如何创建训练数据。我需要训练数据的形式,其中训练数据包括句子中单词的 POS 标签,因此用于 classifier 的特征将类似于 previous word,next word etc.
我正在寻找训练数据,其中包含带有 POS TAGGING 的句子和提到的句子 class。示例:
我/(POS) 姓名/(POS) 是/(POS) XYZ/(POS) CLASS
任何帮助将不胜感激。
如果我没理解错的话,你是想把句子当作一组 POS 标签。
在您的示例中,句子 "My name is XYZ" 将表示为一组 (PRP$, NN, VBZ, NNP)。 这意味着,每个句子实际上是一个 binary 长度为 37 的向量(因为整个句子有 36 possible POS tags according to this page + CLASS 结果特征)
这可以为 OpenNLP Maxent 编码如下:
PRP$=1 NN=1 VBZ=1 NNP=1 CLASS=SomeClassOfYours1
或简单地:
PRP$ NN VBZ NNP CLASS=SomeClassOfYours1
(对于工作代码片段,请在此处查看我的回答:Training models using openNLP maxent)
更多示例数据为:
- "By 1978, Radio City had lost its glamour, and the owners of Rockefeller Center decided to demolish the aging hall."
- "In time he was entirely forgotten, many of his buildings were demolished, others insensitively altered."
- "As soon as she moved out, the mobile home was demolished, the suit said."
- ...
这将产生样本:
IN CD NNP VBD VBN PRP$ NN CC DT NNS IN TO VB VBG CLASS=SomeClassOfYours2
IN NN PRP VBD RB VBN JJ IN PRP$ NNS CLASS=SomeClassOfYours3
IN RB PRP VBD RP DT JJ NN VBN NN CLASS=SomeClassOfYours2
...
但是,我不希望这样的分类产生好的结果。最好利用句子的其他结构特征,例如可以使用例如获得的解析树或依赖树。 Stanford parser.
2016 年 3 月 28 日编辑: 您还可以将整个句子用作训练样本。但是,请注意: - 两个句子可能包含相同的词但具有不同的含义 - 过度拟合的可能性很高 - 你应该使用短句 - 你需要一个庞大的训练集
根据你的例子,我会将训练样本编码如下:
class=CLASS My_PRP name_NN is_VBZ XYZ_NNP
...
请注意,结果变量作为每行的第一个元素出现。
这是一个使用 opennlp-maxent-3.0.3.jar
.
package my.maxent;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.zip.GZIPInputStream;
import opennlp.maxent.GIS;
import opennlp.maxent.io.GISModelReader;
import opennlp.maxent.io.SuffixSensitiveGISModelWriter;
import opennlp.model.AbstractModel;
import opennlp.model.AbstractModelWriter;
import opennlp.model.DataIndexer;
import opennlp.model.DataReader;
import opennlp.model.FileEventStream;
import opennlp.model.MaxentModel;
import opennlp.model.OnePassDataIndexer;
import opennlp.model.PlainTextFileDataReader;
public class MaxentTest {
public static void main(String[] args) throws IOException {
String trainingFileName = "training-file.txt";
String modelFileName = "trained-model.maxent.gz";
// Training a model from data stored in a file.
// The training file contains one training sample per line.
DataIndexer indexer = new OnePassDataIndexer( new FileEventStream(trainingFileName));
MaxentModel trainedMaxentModel = GIS.trainModel(100, indexer); // 100 iterations
// Storing the trained model into a file for later use (gzipped)
File outFile = new File(modelFileName);
AbstractModelWriter writer = new SuffixSensitiveGISModelWriter((AbstractModel) trainedMaxentModel, outFile);
writer.persist();
// Loading the gzipped model from a file
FileInputStream inputStream = new FileInputStream(modelFileName);
InputStream decodedInputStream = new GZIPInputStream(inputStream);
DataReader modelReader = new PlainTextFileDataReader(decodedInputStream);
MaxentModel loadedMaxentModel = new GISModelReader(modelReader).getModel();
// Now predicting the outcome using the loaded model
String[] context = {"is_VBZ", "Gaby_NNP"};
double[] outcomeProbs = loadedMaxentModel.eval(context);
String outcome = loadedMaxentModel.getBestOutcome(outcomeProbs);
System.out.println("=======================================");
System.out.println(outcome);
System.out.println("=======================================");
}
}
和一些虚拟训练数据(存储为training-file.txt
):
class=Male My_PRP name_NN is_VBZ John_NNP
class=Male My_PRP name_NN is_VBZ Peter_NNP
class=Female My_PRP name_NN is_VBZ Anna_NNP
class=Female My_PRP name_NN is_VBZ Gaby_NNP
这会产生以下输出:
Indexing events using cutoff of 0
Computing event counts... done. 4 events
Indexing... done.
Sorting and merging events... done. Reduced 4 events to 4.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 4
Number of Outcomes: 2
Number of Predicates: 7
...done.
Computing model parameters ...
Performing 100 iterations.
1: ... loglikelihood=-2.772588722239781 0.5
2: ... loglikelihood=-2.4410105407571203 1.0
...
99: ... loglikelihood=-0.16111520541752372 1.0
100: ... loglikelihood=-0.15953272940719138 1.0
=======================================
class=Female
=======================================