如何使用 Apache Lucene 5.3.1 仅索引最小长度的单词?

How to index only words with a minimum length using Apache Lucene 5.3.1?

有人可以提示我如何使用 Apache Lucene 5.3.1 仅索引具有最小长度的单词吗?

我搜索了 API,但除了 this 之外没有找到适合我需要的东西,但我不知道如何使用它。

谢谢!

编辑: 我想这是重要的信息,所以这里是我对我想从下面的回复中实现的目标的解释的副本:

"I don't intend to use queries. I want to create a source code summarization tool for which I created a doc-term matrix using Lucene. Now it also shows single- or double-character words. I want to exclude them so they don't show up in the results as they have little value for a summary. I know I could filter them when outputting the results, but that's not a clean solution imo. An even worse would be to add all combinations of single- or double-character words to the stoplist. I am hoping there is a more elegant way then one of those."

您应该使用带有 LengthTokeFilter 的自定义分析器。例如

Analyzer ana = CustomAnalyzer.builder()
                .withTokenizer("standard")
                .addTokenFilter("standard")
                .addTokenFilter("lowercase")     
                .addTokenFilter("length", "min", "4", "max", "50")
                .addTokenFilter("stop", "ignoreCase", "false", "words", "stopwords.txt", "format", "wordset")
                .build();

但最好使用停用词(几乎所有文档中都会出现的词,例如英文文章)列表。这给出了更准确的结果。