卢塞恩。为文本中的每个单词索引一些标记

Question

我在 SpanishAnalyzer (that itself uses SpanishStemmer and StandardTokenizer).
中使用 lucene 3.5 当 SpanishAnalyzer 索引包含单词（例如）"claramente" 和 "claro" 的文档时，它们都将被索引为 "clar".
这种行为已被理解并且对我的需求有用，今天在查询之前，我使用分析器的 tokenStream + incrementToken() 来获取我的搜索词的标记并针对索引文档进行搜索。我没有使用 QueryParser，而是在代码中构建 Lucene 查询对象。
但是我希望能够在不失去 SpanishAnalyzer 的词法能力的情况下搜索确切的词（在本例中为 claro）。
我可以跳过上面的步骤 (tokenStream) 并直接搜索 "claro" 但不会找到它，因为它被索引为 "clar".
此外，我不想使用 2 个不同的分析器对该字段进行两次索引，因为我需要能够使用 PhraseQuery 或 SpanNearQuery 包含一个确切的词和一个常规术语（词法）。
所以……我要说重点了……我想修改 Tokenizer 或 Stemmer 或 Filter (?) 所以在索引时间它将为每个单词索引 2 个标记，词干标记和原始标记，在这种情况下 "claro" 和 "clar" 以及稍后在查询时我可以选择是使用确切的单词还是词干标记。
我需要帮助了解我如何（以及在哪里）可以做到这一点，我想编辑应该在 Stemmer 的某个地方完成。

顺便说一下，我对希伯来语分析器做的完全一样，returns 使用 incrementToken() 时文本中的每个单词都有几个标记（但我没有源代码）

Answer 1

您需要一个每个位置都有多个标记的索引，因为您想要搜索混合有词干标记和非词干标记（=原始）的短语。我会回答 5.3 版，但 3.5 版并没有太大的不同。

查看solr中ReversedWildcardFilter的源代码。您将在每个令牌上看到两个步骤：

存储当前的state with the original token. So the first call of the incrementToken()-method get to the stemmed token and the second call get to the original token (with the same position)
选择 "markerChar" 作为词干标记的前缀。因此，在搜索时，您可以决定是使用词干标记还是原始标记进行搜索。

对于您的 SpanishAnalyzer，这意味着例如以下：

SpanishAnalyzer 的核心是 SpanishLightStemFilter。 SpanishLightStemFilter 仅使用 !KeywordAttribute.isKeyword() 来提取令牌。因此，对于索引时间，在 SpanishAnalyzer 中插入一个 KeywordRepeatFilter 并用前缀标记词干标记。

Answer 2

KeywordRepeatFilter（SpanishLightStemFilter 尊重 KeywordAttribute，有一个令牌过滤器可以很容易地实现这一点。只需将其添加到您的分析链中，就在 Stemmer 之前。对于 SpanishAnalyzer，createComponents 方法将如下所示：

@Override
protected TokenStreamComponents createComponents(String fieldName) {
    final Tokenizer source;
    if (getVersion().onOrAfter(Version.LUCENE_4_7_0)) {
        source = new StandardTokenizer();
    } else {
        source = new StandardTokenizer40();
    }
    TokenStream result = new StandardFilter(source);
    result = new LowerCaseFilter(result);
    result = new StopFilter(result, stopwords);
    if(!stemExclusionSet.isEmpty())
        result = new SetKeywordMarkerFilter(result, stemExclusionSet);
    result = new KeywordRepeatFilter(result);
    result = new SpanishLightStemFilter(result);
    return new TokenStreamComponents(source, result);
}

这将不允许您明确搜索仅无词干的术语，但它会将原始术语保留在与词干相同的位置，从而允许将它们纳入短语查询容易地。如果您确实需要明确地仅搜索有词干或无词干的术语，那么在单独的字段中建立索引确实是更好的方法。

卢塞恩。为文本中的每个单词索引一些标记

Lucene. index a few tokens for each word in the text

java

lucene

query-analyzer