Stanford CorpNLP 返回错误结果

Question

我正在尝试在 this 问题后使用 stanford corenlp 进行词形还原。我的环境是：-

Java1.7
日食 3.4.0
StandfordCoreNLP 版本 3.4.1 (downloaded from here)。

我的代码片段是：-

//...........lemmatization starts........................

    Properties props = new Properties(); 
    props.put("annotators", "tokenize, ssplit, pos, lemma"); 
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props, false);
    String text = "painting"; 
    Annotation document = pipeline.process(text);  

    List<edu.stanford.nlp.util.CoreMap> sentences = document.get(SentencesAnnotation.class);

    for(edu.stanford.nlp.util.CoreMap sentence: sentences) 

    {    
        for(CoreLabel token: sentence.get(TokensAnnotation.class))
        {       
            String word = token.get(TextAnnotation.class);      
            String lemma = token.get(LemmaAnnotation.class); 
            System.out.println("lemmatized version :" + lemma);
        }
    }

    //...........lemmatization ends.........................

我得到的输出是：-

lemmatized version :painting

我期望的地方

lemmatized version :paint

请赐教

Answer 1

这个例子中的问题是单词 painting 可以是 to paint 的现在分词或名词，词形还原器的输出取决于词性分配给原始单词的标签。

如果你运行标注器只在片段绘画上，那么就没有上下文可以帮助标注器（或人类）决定单词如何应该被标记。在这种情况下，它选择了标签 NN 并且名词 painting 的词条实际上是 painting.

如果您运行与句子 "I am painting a flower." 相同的代码，标记器应该正确地将 painting 标记为 VBG 并且词形还原器应该 return 绘画.

Stanford CorpNLP 返回错误结果

Stanford CorpNLP returning wrong results

eclipse-3.4

lemmatization

stanford-nlp

java-7