Lucene 如何获得找到的查询的位置?

Lucene how can I get position of found query?

我有一个 QueryParser,我想在我的文本中找到字符串“War Force”:

TextWord[0]: 2003
TextWord[1]: 09
TextWord[2]: 22T19
TextWord[3]: 01
TextWord[4]: 14Z
TextWord[5]: Book0
TextWord[6]: WEAPONRY
TextWord[7]: NATO2
TextWord[8]: Bar
TextWord[9]: WEAPONRY
TextWord[10]: State
TextWord[11]: WEAPONRY
TextWord[12]: 123
TextWord[13]: War
TextWord[14]: WORD1
TextWord[15]: Force
TextWord[16]: And
TextWord[17]: Book4
TextWord[18]: Book
TextWord[19]: WEAPONRY
TextWord[20]: Book6
TextWord[21]: Terrorist.
TextWord[22]: And
TextWord[23]: WEAPONRY
TextWord[24]: 18
TextWord[25]: 31
TextWord[26]: state
TextWord[27]: AND

我看到我找到了它,当使用短语 slop = 1 时(我的意思是:“war” word1 “force”)。

我可以找到“war”或“force”的位置:

        DirectoryReader reader = DirectoryReader.open(this.memoryIndex);
        IndexSearcher searcher = new IndexSearcher(reader);
        
        QueryParser queryParser = new QueryParser("tags", new StandardAnalyzer());
        Query query = queryParser.parse("\"War Force\"~1");
        TopDocs results = searcher.search(query, 1);

        for (ScoreDoc scoreDoc : results.scoreDocs) {

            Fields termVs = reader.getTermVectors(scoreDoc.doc);
            Terms f = termVs.terms("tags");

            String searchTerm = "War".toLowerCase();
            BytesRef ref = new BytesRef(searchTerm);

            TermsEnum te = f.iterator();
            PostingsEnum docsAndPosEnum = null;
            if (te.seekExact(ref)) {
                
                docsAndPosEnum = te.postings(docsAndPosEnum, PostingsEnum.ALL);
                int nextDoc = docsAndPosEnum.nextDoc();
                assert nextDoc != DocIdSetIterator.NO_MORE_DOCS;
                final int fr = docsAndPosEnum.freq();
                final int p = docsAndPosEnum.nextPosition();
                final int o = docsAndPosEnum.startOffset();

                System.out.println("Word: " + ref.utf8ToString());
                System.out.println("Position: " + p + ", startOffset: " + o + " length: " + ref.length + " Freg: " + fr);
                if (fr > 1) {
                    for (int iter = 1; iter <= fr - 1; iter++) {
                        System.out.println("Possition: " + docsAndPosEnum.nextPosition());
                    }
                }
            }

            System.out.println("Finish");
        }

但我找不到我找到的查询“War Force”或类似查询的位置。如何获取找到的查询结果的位置?

可能有不止一种方法可以做到这一点,但我建议使用 FastVectorHighlighter,因为它可以让您访问位置和偏移数据。

索引要求

要使用这种方法,您需要确保索引数据在创建索引时使用存储术语向量数据的字段:

final String fieldName = "body";
// a shorter version of the input data in the question, for testing:
final String content = "State WEAPONRY 123 War WORD1 Force And Book4 Book WEAPONRY";

FieldType fieldType = new FieldType();
fieldType.setStored(true);
fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
fieldType.setStoreTermVectors(true);
fieldType.setStoreTermVectorPositions(true);
fieldType.setStoreTermVectorOffsets(true);

doc.add(new Field(fieldName, content, fieldType));
writer.addDocument(doc);

(如果您尚未捕获术语向量,这可能会显着增加索引数据的大小。)

图书馆要求

快速矢量荧光笔是 lucene-highlighter 库的一部分:

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-highlighter</artifactId>
    <version>8.9.0</version>
</dependency>

搜索示例

假设以下查询:

final String searchTerm = "\"War Force\"~1";

我们希望它能从我们的测试数据中找到 War WORD1 Force

流程的第一部分执行标准查询执行,使用经典查询解析器:

Directory dir = FSDirectory.open(Paths.get(indexPath));
try ( DirectoryReader dirReader = DirectoryReader.open(dir)) {
    IndexSearcher indexSearcher = new IndexSearcher(dirReader);
    Analyzer analyzer = new StandardAnalyzer();
    QueryParser parser = new QueryParser(fieldName, analyzer);
    Query query = parser.parse(searchTerm);
    TopDocs topDocs = indexSearcher.search(query, 100);
    ScoreDoc[] hits = topDocs.scoreDocs;
    for (ScoreDoc hit : hits) {
        handleHit(hit, query, dirReader, indexSearcher);
    }

handleHit() 方法(如下所示)是我们使用 FastVectorHighlighter.

的地方

如果只想进行高亮显示(不需要position/offset数据),可以使用:

FastVectorHighlighter fvh = new FastVectorHighlighter();
fvh.getBestFragment(fieldQuery, dirReader, docId, fieldName, fragCharSize)

但是要访问我们需要的额外数据,您可以执行以下操作:

FieldTermStack fieldTermStack = new FieldTermStack(dirReader, hit.doc, fieldName, fieldQuery);
FieldPhraseList fieldPhraseList = new FieldPhraseList(fieldTermStack, fieldQuery);
FragListBuilder fragListBuilder = new SimpleFragListBuilder();
FragmentsBuilder fragmentsBuilder = new SimpleFragmentsBuilder();
FastVectorHighlighter fvh = new FastVectorHighlighter(phraseHighlight, fieldMatch,
        fragListBuilder, fragmentsBuilder);

这将构建一个 FastVectorHighlighter,其中包含一个 FieldPhraseList,它将由荧光笔填充。

getBestFragment 方法现在变为:

// use whatever you want for these settings:
int fragCharSize = 100;
int maxNumFragments = 100;
String[] preTags = new String[]{"-->"};
String[] postTags = new String[]{"<--"};

Encoder encoder = new DefaultEncoder();
// the fragments string array contains the highlighted results:
String[] fragments = fvh.getBestFragments(fieldQuery, dirReader, hit.doc,
        fieldName, fragCharSize, maxNumFragments, fragListBuilder,
        fragmentsBuilder, preTags, postTags, encoder);

最后我们可以使用 fieldPhraseList 访问我们需要的数据:

// the following gives you access to positions and offsets:
fieldPhraseList.getPhraseList().forEach(weightedPhraseInfo -> {
    int phraseStartOffset = weightedPhraseInfo.getStartOffset(); // 19
    int phraseEndOffset = weightedPhraseInfo.getEndOffset();     // 34
    weightedPhraseInfo.getTermsInfos().forEach(termInfo -> {
        String term = termInfo.getText();                // "war"  "force"
        int termPosition = termInfo.getPosition() + 1;    // 4      6
        int termStartOffset = termInfo.getStartOffset(); // 19     29
        int termEndOffset = termInfo.getEndOffset();     // 22     34
    });
});

phraseStartOffsetphraseEndOffset 是字符计数,告诉我们整个短语在源文档中的位置:

State WEAPONRY 123 War WORD1 Force And Book4 Book WEAPONRY

所以,在我们的例子中,这是从偏移量 19 到 34 的字符串(偏移量 0 是第一个“S”左侧的位置)。

然后,对于搜索查询中的每个特定术语(“war”和“force”),我们可以访问它们的偏移量,以及它们的词位置 (termPosition)。位置 0 是第一个词,所以我在这个索引上加 1,在原始文档中的位置 4 处给出“war”,在位置 6 处给出“force”:

1     2        3   4   5     6     7   8     9    10
State WEAPONRY 123 War WORD1 Force And Book4 Book WEAPONRY

完整代码供参考:

import java.io.IOException;
import java.math.BigDecimal;
import java.math.RoundingMode;
import java.nio.file.Paths;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.highlight.DefaultEncoder;
import org.apache.lucene.search.highlight.Encoder;
import org.apache.lucene.search.vectorhighlight.FastVectorHighlighter;
import org.apache.lucene.search.vectorhighlight.FieldPhraseList;
import org.apache.lucene.search.vectorhighlight.FieldQuery;
import org.apache.lucene.search.vectorhighlight.FieldTermStack;
import org.apache.lucene.search.vectorhighlight.FragListBuilder;
import org.apache.lucene.search.vectorhighlight.FragmentsBuilder;
import org.apache.lucene.search.vectorhighlight.SimpleFragListBuilder;
import org.apache.lucene.search.vectorhighlight.SimpleFragmentsBuilder;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

public class VectorIndexHighlighterDemo {

    final String indexPath = "./index";
    final String fieldName = "body";
    final String searchTerm = "\"War Force\"~1";

    public void doDemo() throws IOException, ParseException {

        Directory dir = FSDirectory.open(Paths.get(indexPath));
        try ( DirectoryReader dirReader = DirectoryReader.open(dir)) {
            IndexSearcher indexSearcher = new IndexSearcher(dirReader);
            Analyzer analyzer = new StandardAnalyzer();
            QueryParser parser = new QueryParser(fieldName, analyzer);
            Query query = parser.parse(searchTerm);

            System.out.println();
            System.out.println("Search term: [" + searchTerm + "]");
            System.out.println("Parsed query: [" + query.toString() + "]");

            TopDocs topDocs = indexSearcher.search(query, 100);

            ScoreDoc[] hits = topDocs.scoreDocs;
            for (ScoreDoc hit : hits) {
                handleHit(hit, query, dirReader, indexSearcher);
            }
        }
    }

    private void handleHit(ScoreDoc hit, Query query, DirectoryReader dirReader,
            IndexSearcher indexSearcher) throws IOException {

        boolean phraseHighlight = Boolean.TRUE;
        boolean fieldMatch = Boolean.TRUE;
        FieldQuery fieldQuery = new FieldQuery(query, dirReader, phraseHighlight, fieldMatch);

        FieldTermStack fieldTermStack = new FieldTermStack(dirReader, hit.doc, fieldName, fieldQuery);
        FieldPhraseList fieldPhraseList = new FieldPhraseList(fieldTermStack, fieldQuery);
        FragListBuilder fragListBuilder = new SimpleFragListBuilder();
        FragmentsBuilder fragmentsBuilder = new SimpleFragmentsBuilder();
        FastVectorHighlighter fvh = new FastVectorHighlighter(phraseHighlight, fieldMatch,
                fragListBuilder, fragmentsBuilder);

        // use whatever you want for these settings:
        int fragCharSize = 100;
        int maxNumFragments = 100;
        String[] preTags = new String[]{"-->"};
        String[] postTags = new String[]{"<--"};
        
        Encoder encoder = new DefaultEncoder();
        // the fragments string array contains the highlighted results:
        String[] fragments = fvh.getBestFragments(fieldQuery, dirReader, hit.doc,
                fieldName, fragCharSize, maxNumFragments, fragListBuilder,
                fragmentsBuilder, preTags, postTags, encoder);

        // the following gives you access to positions and offsets:
        fieldPhraseList.getPhraseList().forEach(weightedPhraseInfo -> {
            int phraseStartOffset = weightedPhraseInfo.getStartOffset(); // 19
            int phraseEndOffset = weightedPhraseInfo.getEndOffset();     // 34
            weightedPhraseInfo.getTermsInfos().forEach(termInfo -> {
                String term = termInfo.getText();                // "war"  "force"
                int termPosition = termInfo.getPosition() + 1;    // 4      6
                int termStartOffset = termInfo.getStartOffset(); // 19     29
                int termEndOffset = termInfo.getEndOffset();     // 22     34
            });
        });

        // get the scores, also, if needed:
        BigDecimal score = new BigDecimal(String.valueOf(hit.score))
                .setScale(3, RoundingMode.HALF_EVEN);
        Document hitDoc = indexSearcher.doc(hit.doc);
    }

}