如何编写脚本以在斯坦福依赖解析器中保留标点符号

Question

为了得到一些具体的依赖信息，我写了一个java脚本来解析句子，而不是直接使用Stanford Parser 3.9.1提供的ParserDemo.java。但是我发现标点符号在获取 typedDependencies 后丢失了。 Stanford Parser 有获取标点符号的函数吗？我不得不自己编写一个脚本来解析句子，因为我需要从 TypedDependencies 列表创建一个 SemanticGraph，以便使用 SemanticGraph 中的方法来获取每个单个标记相关信息（包括标点符号）。

public class ChineseFileTest3 {

public static void main(String[] args){

    String modelpath = "edu/stanford/nlp/models/lexparser/xinhuaFactored.ser.gz";
    LexicalizedParser lp = LexicalizedParser.loadModel(modelpath);
    String textFile = "data/chinese-onesent-unseg-utf8.txt";
    demoDP(lp,textFile);

}
public static void demoDP(LexicalizedParser lp, String filename){

for(List<HasWord> sentence : new DocumentPreprocessor(filename)) {

    Tree t = lp.apply(sentence);

    ChineseGrammaticalStructure gs = new ChineseGrammaticalStructure(t);
    Collection<TypedDependency> tdl = gs.typedDependenciesCollapsed();
    System.out.println(tdl);

}
}
}

Answer 1

我建议不要单独使用解析器，而是运行管道。这将保持标点符号。

这里有关于使用 Java API 管道的综合文档：

https://stanfordnlp.github.io/CoreNLP/api.html

您需要设置中文属性。一个快速的方法是使用这行代码

Properties props = StringUtils.argsToProperties("-props", "StanfordCoreNLP-chinese.properties");

如何编写脚本以在斯坦福依赖解析器中保留标点符号

How to write scripts to keep punctuation in Stanford dependency parser

text-processing

nlp

stanford-nlp