斯坦福 NLP 文本分类器、自定义特征和混淆矩阵

Question

我使用 Java 代码中的 Stanford NLP 文本分类器 (ColumnDataClassifier)。我有两个主要问题。

1-)如何打印更详细的评估信息，例如混淆矩阵。

2-) 我的代码已经对术语进行预处理并提取数字特征（向量），例如二进制特征或 TF-IDF 值。我如何使用这些功能来训练和测试分类器。

Answer 1

我在. ColumnDataClassifier does not have an option to output the metrics in a confusion matrix. However, if you look at the code in at ColumnDataClassifier.java中问了一个相关问题，你可以看到TP、FP、TN、FN输出到stdin的位置。这个地方有你需要的原始价值。它可以用于将这些聚合成混淆矩阵并在运行之后输出的方法，但是您必须自己编写此代码。

允许您应用一些转换的wiki has an example of how to use numerical features with the ColumnDataClassifier. If you use numerical features, take a look at these options from the API：

realValued  boolean false   Treat this column as real-valued and do not perform any transforms on the feature value.    Value
logTransform    boolean false   Treat this column as real-valued and use the log of the value as the feature value. Log
logitTransform  boolean false   Treat this column as real-valued and use the logit of the value as the feature value.   Logit
sqrtTransform   boolean false   Treat this column as real-valued and use the square root of the value as the feature value. Sqrt

斯坦福 NLP 文本分类器、自定义特征和混淆矩阵

Stanford NLP Text Classifier, Custom Features and Confusion Matrix

stanford-nlp

text-classification