当我使用 Stanford CoreNLP 重新训练情感模型以与相关论文的结果进行比较时,我得到了不同的结果

I got a different result when I retrained the sentiment model with Stanford CoreNLP to compare with the related paper's result

我下载了 stanford-corenlp-full-2015-12-09。 我使用以下命令创建了一个训练模型:

 java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath dev.txt -train -model model.ser.gz

训练结束后,我发现目录中有很多文件。 the model list

然后我使用包中的评估工具,我 运行 代码如下:

java -cp * edu.stanford.nlp.sentiment.Evaluate -model model-0024-79.82.ser.gz -treebank test.txt

test.txt 来自 trainDevTestTrees_PTB.zip。这是关于代码的结果:

F:\trainDevTestTrees_PTB\trees>java -cp * edu.stanford.nlp.sentiment.Evaluate -model model-0024-79.82.ser.gz -treebank test.txt
EVALUATION SUMMARY
Tested 82600 labels
65331 correct
17269 incorrect
0.790932 accuracy
Tested 2210 roots
890 correct
1320 incorrect
0.402715 accuracy
Label confusion matrix
  Guess/Gold       0       1       2       3       4    Marg. (Guess)
           0     551     340      87      32       6    1016
           1     956    5348    2476     686     191    9657
           2     354    2812   51386    3097     467   58116
           3     146     744    2525    6804    1885   12104
           4       1      11      74     379    1242    1707
Marg. (Gold)    2008    9255   56548   10998    3791

           0        prec=0.54232, recall=0.2744, spec=0.99423, f1=0.36442
           1        prec=0.5538, recall=0.57785, spec=0.94125, f1=0.56557
           2        prec=0.8842, recall=0.90871, spec=0.74167, f1=0.89629
           3        prec=0.56213, recall=0.61866, spec=0.92598, f1=0.58904
           4        prec=0.72759, recall=0.32762, spec=0.9941, f1=0.4518

Root label confusion matrix
  Guess/Gold       0       1       2       3       4    Marg. (Guess)
           0      50      60      12       9       3     134
           1     161     370     147      94      36     808
           2      31     103     102      60      32     328
           3      36      97     123     305     265     826
           4       1       3       5      42      63     114
Marg. (Gold)     279     633     389     510     399

           0        prec=0.37313, recall=0.17921, spec=0.9565, f1=0.24213
           1        prec=0.45792, recall=0.58452, spec=0.72226, f1=0.51353
           2        prec=0.31098, recall=0.26221, spec=0.87589, f1=0.28452
           3        prec=0.36925, recall=0.59804, spec=0.69353, f1=0.45659
           4        prec=0.55263, recall=0.15789, spec=0.97184, f1=0.24561

Approximate Negative label accuracy: 0.638817
Approximate Positive label accuracy: 0.697140
Combined approximate label accuracy: 0.671925
Approximate Negative root label accuracy: 0.702851
Approximate Positive root label accuracy: 0.742574
Combined approximate root label accuracy: 0.722680

fine-grained 和 positive/negative 的准确性与论文有很大不同 "Socher, R., Perelygin, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y. and Potts, C., 2013, October. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the conference on empirical methods in natural language processing (EMNLP) (Vol. 1631, p. 1642)." 该论文指出细粒度和 positive/negative 的准确性高于我的。 The records in the paper

我的操作有什么问题吗?为什么我的结果和论文不一样?

简短的回答是这篇论文使用了一个用 Matlab 编写的不同系统。 Java系统与论文不匹配。尽管我们确实分发了我们在 Matlab 中使用英文模型 jar 训练的二进制模型。因此,您可以 运行 使用 Stanford CoreNLP 的二元模型,但此时您无法使用 Stanford CoreNLP 训练具有类似性能的二元模型。