为 CoreNLP 使用 ssplit 选项

Question

根据文档，我可以使用 ssplit.isOneSentence 等选项将我的文档解析为句子。给定一个 StanfordCoreNLP 对象，我到底该怎么做？

这是我的代码 -

Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, depparse");
pipeline.annotate(document);
Annotation document = new Annotation(doc);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);

我应该在什么时候以及在哪里添加这个选项？是这样的吗？

pipeline.ssplit.boundaryTokenRegex = '"'

我还想知道如何将它用于特定选项 boundaryTokenRegex

编辑：

我觉得这样比较合适-

props.put("ssplit.boundaryTokenRegex", "/"");

不过我还要验证一下

Answer 1

将句子标记为以 ' " ' 的任何实例结束的方法是这样的 -

props.setProperty("ssplit.boundaryMultiTokenRegex", "/\'\'/");

或

props.setProperty("ssplit.boundaryMultiTokenRegex", "/\"/");

取决于它的存储方式。（CoreNLP 将其归一化为前者）

如果您想要开始和结束引号 -

props.setProperty("ssplit.boundaryMultiTokenRegex","\/'/'|``\");

为 CoreNLP 使用 ssplit 选项

Using ssplit options for CoreNLP

tokenize

stanford-nlp