增加 Highlighter 返回的文本长度
Increase the length of text returned by Highlighter
最初,我以为setMaxDocCharsToAnalyze(int)
会增加输出长度,但事实并非如此。
目前我的搜索 (String fragment
) 生成的输出不到一行,因此作为 预览 没有任何意义。
能否通过某种机制将 getBestFragment()
生成的输出增加到至少 1 个句子或更多(一个半句子或更多并不重要,但我需要它足够长,至少可以使 一些 有意义)。
索引:
Document document = new Document();
document.add(new TextField(FIELD_CONTENT, content, Field.Store.YES));
document.add(new StringField(FIELD_PATH, path, Field.Store.YES));
indexWriter.addDocument(document);
正在搜索
QueryParser queryParser = new QueryParser(FIELD_CONTENT, new StandardAnalyzer());
Query query = queryParser.parse(searchQuery);
QueryScorer queryScorer = new QueryScorer(query, FIELD_CONTENT);
Fragmenter fragmenter = new SimpleSpanFragmenter(queryScorer);
Highlighter highlighter = new Highlighter(queryScorer); // Set the best scorer fragments
highlighter.setMaxDocCharsToAnalyze(100000); //"HAS NO EFFECT"
highlighter.setTextFragmenter(fragmenter);
// STEP B
File indexFile = new File(INDEX_DIRECTORY);
Directory directory = FSDirectory.open(indexFile.toPath());
IndexReader indexReader = DirectoryReader.open(directory);
// STEP C
System.out.println("query: " + query);
ScoreDoc scoreDocs[] = searcher.search(query, MAX_DOC).scoreDocs;
for (ScoreDoc scoreDoc : scoreDocs)
{
//System.out.println("1");
Document document = searcher.getDocument(scoreDoc.doc);
String title = document.get(FIELD_CONTENT);
TokenStream tokenStream = TokenSources.getAnyTokenStream(indexReader,
scoreDoc.doc, FIELD_CONTENT, document, new StandardAnalyzer());
String fragment = highlighter.getBestFragment(tokenStream, title); //Increase the length of the this String this is the output
System.out.println(fragment + "-------");
}
示例输出
query: +Content:canada +Content:minister
|Liberal]] [[Prime Minister of Canada|Prime Minister]] [[Pierre Trudeau]] led a [[Minority-------
. Thorson, Minister of National War Services, Ottawa. Printed in Canada Description: British lion-------
politician of the [[New Zealand Labour Party| Labour Party]], and a cabinet minister. He represented-------
|}}}| ! [[Minister of Finance (Canada)|Minister]] {{!}} {{{minister-------
, District of Franklin''. Ottawa: Minister of Supply and Services Canada, 1977. ISBN 0660008351
25]], [[1880]] – [[March 4]], [[1975]]) was a [[Canada|Canadian]] provincial and federal-------
-du-Quebec]] region, in [[Canada]]. It is named after the first French Canadian to become Prime-------
11569347, Cannon_family_(Canada) ::: {{for|the American political family|Cannon family-------
minister of [[Guyana]] and prominent Hindu politician in [[Guyana]]. He also served, at various times-------
11559743, Mohammed_Hussein_Al_Shaali ::: '''Mohammed Hussein Al Shaali''' is the former Minister-------
Fragmenter
是控制此行为的部分。您可以将 int
传递给 SimpleSpanFragmenter
构造函数以控制它生成的片段的大小(以字节为单位)。默认大小为 100。例如,将其加倍:
Fragmenter fragmenter = new SimpleSpanFragmenter(queryScorer, 200);
就句子边界的拆分而言,没有开箱即用的分片器。有人发布了他们的 implementation of one here。这是一个 非常幼稚 的实现,但如果您想深入了解那个特定的兔子洞,您可能会发现它很有帮助。
最初,我以为setMaxDocCharsToAnalyze(int)
会增加输出长度,但事实并非如此。
目前我的搜索 (String fragment
) 生成的输出不到一行,因此作为 预览 没有任何意义。
能否通过某种机制将 getBestFragment()
生成的输出增加到至少 1 个句子或更多(一个半句子或更多并不重要,但我需要它足够长,至少可以使 一些 有意义)。
索引:
Document document = new Document();
document.add(new TextField(FIELD_CONTENT, content, Field.Store.YES));
document.add(new StringField(FIELD_PATH, path, Field.Store.YES));
indexWriter.addDocument(document);
正在搜索
QueryParser queryParser = new QueryParser(FIELD_CONTENT, new StandardAnalyzer());
Query query = queryParser.parse(searchQuery);
QueryScorer queryScorer = new QueryScorer(query, FIELD_CONTENT);
Fragmenter fragmenter = new SimpleSpanFragmenter(queryScorer);
Highlighter highlighter = new Highlighter(queryScorer); // Set the best scorer fragments
highlighter.setMaxDocCharsToAnalyze(100000); //"HAS NO EFFECT"
highlighter.setTextFragmenter(fragmenter);
// STEP B
File indexFile = new File(INDEX_DIRECTORY);
Directory directory = FSDirectory.open(indexFile.toPath());
IndexReader indexReader = DirectoryReader.open(directory);
// STEP C
System.out.println("query: " + query);
ScoreDoc scoreDocs[] = searcher.search(query, MAX_DOC).scoreDocs;
for (ScoreDoc scoreDoc : scoreDocs)
{
//System.out.println("1");
Document document = searcher.getDocument(scoreDoc.doc);
String title = document.get(FIELD_CONTENT);
TokenStream tokenStream = TokenSources.getAnyTokenStream(indexReader,
scoreDoc.doc, FIELD_CONTENT, document, new StandardAnalyzer());
String fragment = highlighter.getBestFragment(tokenStream, title); //Increase the length of the this String this is the output
System.out.println(fragment + "-------");
}
示例输出
query: +Content:canada +Content:minister
|Liberal]] [[Prime Minister of Canada|Prime Minister]] [[Pierre Trudeau]] led a [[Minority-------
. Thorson, Minister of National War Services, Ottawa. Printed in Canada Description: British lion-------
politician of the [[New Zealand Labour Party| Labour Party]], and a cabinet minister. He represented-------
|}}}| ! [[Minister of Finance (Canada)|Minister]] {{!}} {{{minister-------
, District of Franklin''. Ottawa: Minister of Supply and Services Canada, 1977. ISBN 0660008351 25]], [[1880]] – [[March 4]], [[1975]]) was a [[Canada|Canadian]] provincial and federal-------
-du-Quebec]] region, in [[Canada]]. It is named after the first French Canadian to become Prime-------
11569347, Cannon_family_(Canada) ::: {{for|the American political family|Cannon family-------
minister of [[Guyana]] and prominent Hindu politician in [[Guyana]]. He also served, at various times-------
11559743, Mohammed_Hussein_Al_Shaali ::: '''Mohammed Hussein Al Shaali''' is the former Minister-------
Fragmenter
是控制此行为的部分。您可以将 int
传递给 SimpleSpanFragmenter
构造函数以控制它生成的片段的大小(以字节为单位)。默认大小为 100。例如,将其加倍:
Fragmenter fragmenter = new SimpleSpanFragmenter(queryScorer, 200);
就句子边界的拆分而言,没有开箱即用的分片器。有人发布了他们的 implementation of one here。这是一个 非常幼稚 的实现,但如果您想深入了解那个特定的兔子洞,您可能会发现它很有帮助。