在 Apache 的 Lucene 中使用默认和自定义停用词（奇怪的输出）

Question

我正在使用 Apache 的 Lucene (8.6.3) 和以下 Java 8 代码从字符串中删除停用词：

private static final String CONTENTS = "contents";
final String text = "This is a short test! Bla!";
final List<String> stopWords = Arrays.asList("short","test");
final CharArraySet stopSet = new CharArraySet(stopWords, true);

try {
    Analyzer analyzer = new StandardAnalyzer(stopSet);
    TokenStream tokenStream = analyzer.tokenStream(CONTENTS, new StringReader(text));
    CharTermAttribute term = tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.reset();

    while(tokenStream.incrementToken()) {
        System.out.print("[" + term.toString() + "] ");
    }

    tokenStream.close();
    analyzer.close();
} catch (IOException e) {
    System.out.println("Exception:\n");
    e.printStackTrace();
}

这将输出所需的结果：

[this] [is] [a] [bla]

现在我想同时使用默认的英语停止集，它还应该删除“this”、“is”和“a”（根据 github）和上面的自定义停止集（实际我要用的一个要长得多），所以我试了这个：

Analyzer analyzer = new EnglishAnalyzer(stopSet);

输出为：

[thi] [is] [a] [bla]

是的，“this”中的“s”不见了。这是什么原因造成的？它也没有使用默认的停止集。

以下更改删除了默认停用词和自定义停用词：

Analyzer analyzer = new EnglishAnalyzer();
TokenStream tokenStream = analyzer.tokenStream(CONTENTS, new StringReader(text));
tokenStream = new StopFilter(tokenStream, stopSet);

问题：“正确”的方法是什么？在自身内部使用 tokenStream（见上面的代码）会导致问题吗？

奖金问题：如何输出剩余的单词 upper/lower 大小写正确，因此它们在原文中使用了什么？

Answer 1

我将分两部分解决这个问题：

stop-words
保留原始大小写

处理组合停用词

要处理Lucene的英文停用词列表的组合，加上您自己的自定义列表，您可以创建一个合并列表如下：

import org.apache.lucene.analysis.en.EnglishAnalyzer;

...

final List<String> stopWords = Arrays.asList("short", "test");
final CharArraySet stopSet = new CharArraySet(stopWords, true);

CharArraySet enStopSet = EnglishAnalyzer.ENGLISH_STOP_WORDS_SET;
stopSet.addAll(enStopSet);

上面的代码简单地获取了与 Lucene 捆绑在一起的英语停用词，然后将其与您的列表合并。

给出以下输出：

[bla]

处理单词大小写

这有点复杂。正如您所注意到的，StandardAnalyzer 包括一个将所有单词转换为小写的步骤 - 因此我们不能使用它。

此外，如果您想维护自己的自定义停用词列表，并且该列表有任何大小，我建议将其存储在自己的文本文件中，而不是将列表嵌入到您的代码中。

那么，假设您有一个名为 stopwords.txt 的文件。在此文件中，每行一个词 - 文件中已经包含自定义停用词的合并列表和英文停用词的官方列表。

您需要自己手动准备此文件（即忽略此答案第 1 部分中的注释）。

我的测试文件是这样的：

short
this
is
a
test
the
him
it

我也更喜欢使用 CustomAnalyzer 来做这样的事情，因为它可以让我非常简单地构建分析器。

import org.apache.lucene.analysis.custom.CustomAnalyzer;

...

Analyzer analyzer = CustomAnalyzer.builder()
        .withTokenizer("icu")
        .addTokenFilter("stop",
                "ignoreCase", "true",
                "words", "stopwords.txt",
                "format", "wordset")
        .build();

这会执行以下操作：

它使用“icu”分词器org.apache.lucene.analysis.icu.segmentation.ICUTokenizer，它负责对 Unicode 空格进行分词，并处理标点符号。
它应用停用词列表。请注意对 ignoreCase 属性使用 true，以及对 stop-word 文件的引用。 wordset 的格式表示“每行一个字”（还有其他格式）。

这里的关键是上面的链中没有任何改变单词大小写的东西。

所以，现在，使用这个新的分析器，输出如下：

[Bla]

最后的笔记

你把停止列表文件放在哪里？默认情况下，Lucene 希望在应用程序的 class 路径中找到它。因此，例如，您可以将其放入默认包中。

但请记住，该文件需要由您的构建过程处理，以便它与应用程序的 class 文件一起结束（不会与源代码一起留下）。

我主要使用 Maven - 因此我在我的 POM 中有这个以确保根据需要部署“.txt”文件：

    <build>  
        <resources>  
            <resource>  
                <directory>src/main/java</directory>  
                <excludes>  
                    <exclude>**/*.java</exclude>  
                </excludes>  
            </resource>  
        </resources>  
    </build>

这告诉 Maven 将文件（Java 源文件除外）复制到构建目标 - 从而确保复制文本文件。

最后的说明 - 我没有调查你为什么得到那个截断的 [thi] 令牌。如果有机会我会仔细看看。

Follow-Up 问题

After combining I have to use the StandardAnalyzer, right?

是的，没错。我在答案的第 1 部分中提供的注释直接与您问题中的代码以及您使用的 StandardAnalyzer 相关。

I want to keep the stop word file on a specific non-imported path - how to do that?

您可以让 CustomAnalyzer 在“资源”目录中查找 stop-words 文件。该目录可以位于文件系统上的任何位置（如您所述，以便于维护）：

import java.nio.file.Path;
import java.nio.file.Paths;

...

Path resources = Paths.get("/path/to/resources/directory");

Analyzer analyzer = CustomAnalyzer.builder(resources)
        .withTokenizer("icu")
        .addTokenFilter("stop",
                "ignoreCase", "true",
                "words", "stopwords.txt",
                "format", "wordset")
        .build();

我们现在使用 .builder(resources)，而不是 .builder()。

在 Apache 的 Lucene 中使用默认和自定义停用词（奇怪的输出）

Using default and custom stop words with Apache's Lucene (weird output)

java

lucene

stop-words