如何使用 Apache Lucene 5.3.1 仅索引最小长度的单词？

Question

有人可以提示我如何使用 Apache Lucene 5.3.1 仅索引具有最小长度的单词吗？

我搜索了 API，但除了 this 之外没有找到适合我需要的东西，但我不知道如何使用它。

谢谢！

编辑：我想这是重要的信息，所以这里是我对我想从下面的回复中实现的目标的解释的副本：

"I don't intend to use queries. I want to create a source code summarization tool for which I created a doc-term matrix using Lucene. Now it also shows single- or double-character words. I want to exclude them so they don't show up in the results as they have little value for a summary. I know I could filter them when outputting the results, but that's not a clean solution imo. An even worse would be to add all combinations of single- or double-character words to the stoplist. I am hoping there is a more elegant way then one of those."

Answer 1

您应该使用带有 LengthTokeFilter 的自定义分析器。例如

Analyzer ana = CustomAnalyzer.builder()
                .withTokenizer("standard")
                .addTokenFilter("standard")
                .addTokenFilter("lowercase")     
                .addTokenFilter("length", "min", "4", "max", "50")
                .addTokenFilter("stop", "ignoreCase", "false", "words", "stopwords.txt", "format", "wordset")
                .build();

但最好使用停用词（几乎所有文档中都会出现的词，例如英文文章）列表。这给出了更准确的结果。

如何使用 Apache Lucene 5.3.1 仅索引最小长度的单词？

How to index only words with a minimum length using Apache Lucene 5.3.1?

apache

lucene

nlp

minimum

word