使用 Stanford CoreNLP 进行惰性解析以仅获取特定句子的情绪

Question

我正在寻找优化我的 Stanford CoreNLP 情感管道性能的方法。因此，A 想要获得句子的情感，但只有那些包含作为输入给出的特定关键字的句子。

我试过两种方法：

方法 1：StanfordCoreNLP 管道用情感注释整个文本

我已经定义了注释器管道：tokenize、ssplit、parse、sentiment。我在整篇文章中都有运行它，然后在每个句子中查找关键字，如果它们存在，运行一个返回关键字值的方法。虽然处理需要几秒钟，但我不满意。

这是代码：

List<String> keywords = ...;
String text = ...;
Map<Integer,Integer> sentenceSentiment = new HashMap<>();

Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, parse, sentiment");
props.setProperty("parse.maxlen", "20");
props.setProperty("tokenize.options", "untokenizable=noneDelete");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

Annotation annotation = pipeline.process(text); // takes 2 seconds!!!!
List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
for (int i=0; i<sentences.size(); i++) {
    CoreMap sentence = sentences.get(i);
    if(sentenceContainsKeywords(sentence,keywords) {
        int sentiment = RNNCoreAnnotations.getPredictedClass(sentence.get(SentimentCoreAnnotations.SentimentAnnotatedTree.class));
        sentenceSentiment.put(sentence,sentiment);
    }
}

方法 2：StanfordCoreNLP 流水线用句子注释整个文本，在感兴趣的句子上单独注释运行

由于第一种方案性能较差，我定义了第二种方案。我已经定义了一个带有注释器的管道：tokenize、ssplit。我在每个句子中查找关键字，如果它们存在，我只为这个句子创建了一个注释，运行它的下一个注释器：ParserAnnotator、BinarizerAnnotator 和 SentimentAnnotator。

因为ParserAnnotator，结果真的不尽如人意。即使我用相同的属性初始化它。有时，在方法 1 中的文档上花费的时间甚至比整个管道运行还要多。

List<String> keywords = ...;
String text = ...;
Map<Integer,Integer> sentenceSentiment = new HashMap<>();

Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit"); // parsing, sentiment removed
props.setProperty("parse.maxlen", "20");
props.setProperty("tokenize.options", "untokenizable=noneDelete");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

// initiation of annotators to be run on sentences
ParserAnnotator parserAnnotator = new ParserAnnotator("pa", props);
BinarizerAnnotator  binarizerAnnotator = new BinarizerAnnotator("ba", props);
SentimentAnnotator sentimentAnnotator = new SentimentAnnotator("sa", props);

Annotation annotation = pipeline.process(text); // takes <100 ms
List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
for (int i=0; i<sentences.size(); i++) {
    CoreMap sentence = sentences.get(i);
    if(sentenceContainsKeywords(sentence,keywords) {
        // code required to perform annotation on one sentence
        List<CoreMap> listWithSentence = new ArrayList<CoreMap>();
        listWithSentence.add(sentence);
        Annotation sentenceAnnotation  = new Annotation(listWithSentence);

        parserAnnotator.annotate(sentenceAnnotation); // takes 50 ms up to 2 seconds!!!!
        binarizerAnnotator.annotate(sentenceAnnotation);
        sentimentAnnotator.annotate(sentenceAnnotation);
        sentence = sentenceAnnotation.get(CoreAnnotations.SentencesAnnotation.class).get(0);

        int sentiment = RNNCoreAnnotations.getPredictedClass(sentence.get(SentimentCoreAnnotations.SentimentAnnotatedTree.class));
        sentenceSentiment.put(sentence,sentiment);
    }
}

问题

我想知道为什么在CoreNLP中解析不是"lazy"？（在我的示例中，这意味着：仅在调用句子的情绪时执行）。是性能原因吗？
为什么一个句子的解析器几乎可以像整篇文章（我的文章有 7 个句子）的解析器一样工作？是否可以以更快的方式配置它？

Answer 1

如果您希望加快选区解析，最好的改进是使用新的 shift-reduce constituency parser。它比默认的 PCFG 解析器快几个数量级。

您以后的问题的答案：

为什么 CoreNLP 解析不是惰性的？ 这当然是可能的，但我们还没有在管道中实现。我们可能还没有在内部看到很多有必要这样做的用例。如果您有兴趣，我们很乐意接受 "lazy annotator wrapper" 的贡献！
为什么一个句子的解析器几乎可以像整篇文章的解析器一样工作？默认的 Stanford PCFG 解析器是 cubic time complexity句子长度。这就是为什么我们通常出于性能原因建议限制最大句子长度。另一方面，shift-reduce 解析器的运行时间与句子长度成线性关系。

使用 Stanford CoreNLP 进行惰性解析以仅获取特定句子的情绪

Lazy parsing with Stanford CoreNLP to get sentiment only of specific sentences

java

performance

parsing

stanford-nlp

sentiment-analysis