TermRangeQuery 有什么问题?

What's wrong with TermRangeQuery?

TermRangeQuery 的行为与我预期的不一样。
我是 Lucene 的新手,也是 Java.
的新手 所以,我可能不明白我的代码应该产生什么结果,或者我犯了一些丑陋的错误。
这是代码(你可以在 https://repl.it/@Tekener/AstonishingAridWatch 试试):

import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermRangeQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;

@SuppressWarnings("deprecation")
class Main {
    private static IndexSearcher indexSearcher;
    private static IndexReader indexReader;
    private static String separatorLine = "===========================";

    public static void main(String[] args) throws IOException {
        Analyzer analyzer = new StandardAnalyzer();
        Directory directory = new RAMDirectory();
        IndexWriterConfig config = new IndexWriterConfig(analyzer);
        IndexWriter indexWriter = new IndexWriter(directory, config);

        System.out.println(separatorLine);
        System.out.println("Building the index:");
        indexWriter.addDocument(createDocumentWithFields("1st", "Humpty Dumpty sat on a wall,"));
        indexWriter.addDocument(createDocumentWithFields("2nd", "Humpty Dumpty had a great fall."));
        indexWriter.addDocument(createDocumentWithFields("3rd", "All the king's horses and all the king'smen"));
        indexWriter.addDocument(createDocumentWithFields("4th", "Couldn't put Humpty together again."));
        System.out.println(separatorLine);

        indexWriter.commit();
        indexWriter.close();        

        indexReader = DirectoryReader.open(directory);
        indexSearcher = new IndexSearcher(indexReader);

        showQueryResult(1, TermRangeQuery.newStringRange("content", "a", "h", true, true));
        showQueryResult(2, TermRangeQuery.newStringRange("content", "A", "H", true, true));
        showQueryResult(3, TermRangeQuery.newStringRange("content", "a", "f", true, true));
        showQueryResult(4, TermRangeQuery.newStringRange("content", "A", "F", true, true));
    }

    private static void showQueryResult(int queryNo, Query query) throws IOException {
        System.out.println(String.format("Query #%d: %s", queryNo, query.toString()));
        TopDocs topDocs = indexSearcher.search(query, 100);
        System.out.println("Result:");
        for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
            Document doc = indexReader.document(scoreDoc.doc);
            System.out.println(String.format("name: %s - content: %s", doc.getField("name").stringValue(), doc.getField("content").stringValue()));
        }
        System.out.println(separatorLine);
    }

    private static Document createDocumentWithFields(String name, String content) {
        System.out.println(String.format("name: %s - content: %s", name, content));
        Document doc = new Document();
        doc.add(new StringField("name",  name,    Store.YES));
        doc.add(new TextField("content", content, Store.YES));
        return doc;
    }
}

这是控制台输出:

===========================
Building the index:
name: 1st - content: Humpty Dumpty sat on a wall,
name: 2nd - content: Humpty Dumpty had a great fall.
name: 3rd - content: All the king's horses and all the king'smen
name: 4th - content: Couldn't put Humpty together again.
===========================
Query #1: content:[a TO h]
Result:
name: 1st - content: Humpty Dumpty sat on a wall,
name: 2nd - content: Humpty Dumpty had a great fall.
name: 3rd - content: All the king's horses and all the king'smen
name: 4th - content: Couldn't put Humpty together again.
===========================
Query #2: content:[A TO H]
Result:
===========================
Query #3: content:[a TO f]
Result:
name: 1st - content: Humpty Dumpty sat on a wall,
name: 2nd - content: Humpty Dumpty had a great fall.
name: 3rd - content: All the king's horses and all the king'smen
name: 4th - content: Couldn't put Humpty together again.
===========================
Query #4: content:[A TO F]
Result:
===========================

我的结论:
如果索引文本(对于 "content" 字段)存储为小写字符串,查询 #1、#2 和 #4 的结果可能是正确的。
但如果是这种情况,查询 #3 的结果将是错误的。
在查询 #3 中只能找到第 3 个和第 4 个条目。
我的错误在哪里?

标准分析器uses the lower case filter - 所以,是的,所有索引数据都将是小写的:

Filters StandardTokenizer with LowerCaseFilter and StopFilter, using a configurable list of stop words.

此外,请记住:

TermRangeQuery.newStringRange("content", "a", "f", true, true);

表示 "a" 和 "f" 包含在范围内(true 值)。

因此 "had a great fall" 中的 "a" 是一个匹配项。这就是在查询 3 中找到所有 4 个结果的原因。将第 3 个搜索更改为类似以下内容以查看影响:

TermRangeQuery.newStringRange("content", "a", "b", true, true);
TermRangeQuery.newStringRange("content", "a", "b", false, false);

下一点与您的问题并不完全相关,但可能会有用。通常希望在执行搜索时使用与在索引数据时使用的分析器相同的分析器(也有例外)。因此,例如,搜索以不区分大小写的方式匹配搜索词是很常见的。通过对搜索词使用标准分析器,您可以实现这一目标。

有多种方法可以做到这一点 - 这是一种方法 - 可能有更巧妙的方法:

QueryParser parser = new QueryParser("content", analyzer);

Query q1 = TermRangeQuery.newStringRange("content", "b", "h", true, true);
Query query1 = parser.parse(q1.toString());
showQueryResult(1, query1);

根据以上所述,结果应该是有意义的。

如果您想探索实际被索引的内容,我建议更改为使用此:

org.apache.lucene.store.MMapDirectory;

还有这样的东西:

Directory directory = new MMapDirectory(Paths.get("E:/lucene/indexes/range_queries"));

而且,无论如何 RAMDirectory 是 not generally recommended - 除了演示之外。

一旦数据在磁盘上,您就可以将 Luke 指向它 - 一个用于探索索引数据的非常有用的工具(带有 GUI)。它以 JAR 文件 (lucene-luke-8.x.x.jar) 的形式提供,可以在主要的 Lucene 二进制发布包中找到。

编辑:

如果您使用 RAMDirectory,您可能还想使用这个:

if (!DirectoryReader.indexExists(directory)) {
    // index builder logic here
}

这避免了用重复数据重新填充索引。

关于停用词:默认情况下,标准分析器有一个空的停用词列表。您可以在 org.apache.lucene.analysis.CharArraySet:

中向构造函数提供单词列表
import org.apache.lucene.analysis.CharArraySet;

...

CharArraySet myStopWords = new CharArraySet(2, true); 
myStopWords .add("foo");
myStopWords .add("bar");
Analyzer analyzer = new StandardAnalyzer(myStopWords);

或者您可以使用现有的停用词列表之一。这是英语停用词:

import static org.apache.lucene.analysis.en.EnglishAnalyzer.ENGLISH_STOP_WORDS_SET;

...

Analyzer analyzer = new StandardAnalyzer(ENGLISH_STOP_WORDS_SET);