如何使用 stanford tokenize 分离单词和特殊字符？

Question

我正在使用 Stanford CoreNLP 工具，我需要将链分离为： “（参见功能要求编号 150）。”

我的代码的结果是（在核心标签中）： [(see, functional, requirement, number, 150).]

什么时候应该： [(,see, functional, requirement, number, 150,),.]

代码段为：

public List<CoreMap> armador(String text){

   Properties props;
   StanfordCoreNLP pipeline;
   props.put("annotators", "tokenize,ssplit,pos");
   props.put("ssplit.eolonly", "true");
   props.put("tokenize.whitespace", "true");

   pipeline = new StanfordCoreNLP(props);
   Annotation document = new Annotation(text);
   pipeline.annotate(document);
   List<CoreMap> result = document.get(CoreAnnotations.SentencesAnnotation.class);  

   return result;
}

谢谢，对不起我的英语！

Answer 1

这是由于属性:

props.put("tokenize.whitespace", "true");

默认情况下，CoreNLP 将运行 Penn Treebank 标记化，这将正确标记出括号。但是，属性 tokenize.whitespace 强制 CoreNLP 仅对空白标记进行标记化。

编辑你也许还应该警惕 props.put("ssplit.eolonly", "true");——这只会在换行符上拆分句子。

如何使用 stanford tokenize 分离单词和特殊字符？

how to separate words and special character using stanford tokenize?

java

tokenize

stanford-nlp