Lucene搜索后，获取文档中所有匹配词的字符偏移量？（不仅仅是预览片段）

Question

我正在使用 lucene 为大量 HTML 文档创建一个搜索引擎。

我知道我可以使用 PostingsHighlighter 和朋友来显示片段，用粗体字，类似于 Google 搜索结果，也类似于 this random lucene-based example。

但是，与这些示例不同的是，我需要一个能够保留突出显示单词的解决方案，即使在用户打开匹配的文档之后，类似于 Google 书籍。

有些词是带连字符的，形式为 <div> ... an inter-</div><div...>national audience ...</div> 我想我需要先将它们转换为纯文本，然后编写一些代码来合并带连字符的词，然后再将它们发送到 lucene。

用户打开生成的文档后，我希望可以使用 lucene 获取文档中每个匹配词的字符偏移量。

我将不得不将纯文本中的偏移量交叉引用回原始 HTML，并编写代码以突出显示 <b> 基于所述偏移量的单词。

<div> ... an <b>inter-</b></div><div...><b>national</b> audience ...</div>

我怎样才能从lucene中得到我需要的东西？当然，我不必为此编写自己的搜索 'final inch'?

Answer 1

好的，我想出了可以开始的东西。 :)

要索引：

StandardAnalyzer analyzer - new StandardAnalyzer()
Directory index = FSDirectory.open(new File("...").toPath());
IndexWriterConfig config = new IndexWriterConfig(analyzer);
addDoc(writer, "...", "...");
addDoc(writer, "...", "...");
addDoc(writer, "...", "...");
// documents need to be read from the data source..
// only add once, or else your docs will be duplicated as you continue to use the system
writer.close();

指定要存储的偏移量以突出显示

private static final FieldType typeOffsets;
static {
    typeOffsets = new FieldType(textField.TYPE_STORED);
    typeOffsets.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
}

方法 addDoc

void addDoc(IndexWriter writer, String title, String body) {
  Document doc = new Document();
  doc.add(new Field("title", body, typeOffsets));
  doc.add(new Field("body", body, typeOffsets));
  // you can also add an store a TextField that does not have offsets,
  // like a file ID that you wouldn't search on, just need to reference original doc.
  writer.addDocument(doc);
}

执行第一次搜索

String q = "...";
String[] fields = new String[] {"title", "body"};
QueryParser parser = new MultiFieldQueryParser(fields, analyzer)
Query query = parser.parse(q)
IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(index));
PostingsHighlighter highlighter = new PostingsHighlighter();
TopDocs topDocs = searcher.search(query, 10, Sort.RELEVANCE);

使用 highlighter.highlightFields(fields, query, searcher, topDocs) 获取突出显示的片段。您可以迭代结果。

当您想要突出显示结束文档时（即在搜索完成并且用户选择了结果之后），请使用 this solution（需要进行少量编辑）。它的工作原理是使用 NullFragmenter 将整个事情变成一个片段。

public static String highlight(String pText, String pQuery) throws Exception
{
    Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
    QueryParser parser = new QueryParser(Version.LUCENE_30, "", analyzer);
    Highlighter highlighter = new Highlighter(new QueryScorer(parser.parse(pQuery)));
    highlighter.setTextFragmenter(new NullFragmenter());

    String text = highlighter.getBestFragment(analyzer, "", pText);

    if (text != null)
    {
        return text;
    }
    return pText;    
}

编辑：您实际上可以在最后一步中使用 PostingsHighlighter 而不是 Highlighter，但您必须覆盖 getBreakIterator，然后覆盖您的 BreakIterator，以便它认为整个文档是一个句子。

编辑：您可以覆盖 getFormatter 以捕获偏移量，而不是尝试解析通常由 PostingsHighlighter.

输出的 <b> 标签

Lucene搜索后，获取文档中所有匹配词的字符偏移量？ （不仅仅是预览片段）

After Lucene search, get character offsets of all matched words in document? (not just preview snippet)

lucene

highlight

Lucene搜索后，获取文档中所有匹配词的字符偏移量？（不仅仅是预览片段）