Text tokenization with Stanford NLP : 过滤不需要的单词和字符

Question

我在我的分类工具中使用 Stanford NLP 进行字符串标记化。我只想得到有意义的词，但我得到的是非词标记（如 ---、>、. 等）而不是重要的词，如 am、is, to（停用词）。有人知道解决这个问题的方法吗？

Answer 1

这是一个非常特定领域的任务，我们不会在 CoreNLP 中为您执行。您应该能够使用正则表达式过滤器和 CoreNLP 分词器之上的 stopword 过滤器来完成这项工作。

这里是an example list of English stopwords。

Answer 2

在 stanford Corenlp 中，有一个 stopword removal annotator 提供了删除标准停用词的功能。您还可以根据需要在此处定义自定义停用词（即 ---、<、. 等）

可以看例子here:

   Properties props = new Properties();
   props.put("annotators", "tokenize, ssplit, stopword");
   props.setProperty("customAnnotatorClass.stopword", "intoxicant.analytics.coreNlp.StopwordAnnotator");

   StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
   Annotation document = new Annotation(example);
   pipeline.annotate(document);
   List<CoreLabel> tokens = document.get(CoreAnnotations.TokensAnnotation.class);

这里在上面的例子中“tokenize,ssplit,stopwords”设置为自定义停用词。

希望对您有所帮助....!!

Text tokenization with Stanford NLP : 过滤不需要的单词和字符

Text tokenization with Stanford NLP : Filter unrequired words and characters

java

machine-learning

tokenize

stanford-nlp