Lucene 如何获得找到的查询的位置?
Lucene how can I get position of found query?
我有一个 QueryParser,我想在我的文本中找到字符串“War Force”:
TextWord[0]: 2003
TextWord[1]: 09
TextWord[2]: 22T19
TextWord[3]: 01
TextWord[4]: 14Z
TextWord[5]: Book0
TextWord[6]: WEAPONRY
TextWord[7]: NATO2
TextWord[8]: Bar
TextWord[9]: WEAPONRY
TextWord[10]: State
TextWord[11]: WEAPONRY
TextWord[12]: 123
TextWord[13]: War
TextWord[14]: WORD1
TextWord[15]: Force
TextWord[16]: And
TextWord[17]: Book4
TextWord[18]: Book
TextWord[19]: WEAPONRY
TextWord[20]: Book6
TextWord[21]: Terrorist.
TextWord[22]: And
TextWord[23]: WEAPONRY
TextWord[24]: 18
TextWord[25]: 31
TextWord[26]: state
TextWord[27]: AND
我看到我找到了它,当使用短语 slop = 1 时(我的意思是:“war” word1 “force”)。
我可以找到“war”或“force”的位置:
DirectoryReader reader = DirectoryReader.open(this.memoryIndex);
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser queryParser = new QueryParser("tags", new StandardAnalyzer());
Query query = queryParser.parse("\"War Force\"~1");
TopDocs results = searcher.search(query, 1);
for (ScoreDoc scoreDoc : results.scoreDocs) {
Fields termVs = reader.getTermVectors(scoreDoc.doc);
Terms f = termVs.terms("tags");
String searchTerm = "War".toLowerCase();
BytesRef ref = new BytesRef(searchTerm);
TermsEnum te = f.iterator();
PostingsEnum docsAndPosEnum = null;
if (te.seekExact(ref)) {
docsAndPosEnum = te.postings(docsAndPosEnum, PostingsEnum.ALL);
int nextDoc = docsAndPosEnum.nextDoc();
assert nextDoc != DocIdSetIterator.NO_MORE_DOCS;
final int fr = docsAndPosEnum.freq();
final int p = docsAndPosEnum.nextPosition();
final int o = docsAndPosEnum.startOffset();
System.out.println("Word: " + ref.utf8ToString());
System.out.println("Position: " + p + ", startOffset: " + o + " length: " + ref.length + " Freg: " + fr);
if (fr > 1) {
for (int iter = 1; iter <= fr - 1; iter++) {
System.out.println("Possition: " + docsAndPosEnum.nextPosition());
}
}
}
System.out.println("Finish");
}
但我找不到我找到的查询“War Force”或类似查询的位置。如何获取找到的查询结果的位置?
可能有不止一种方法可以做到这一点,但我建议使用 FastVectorHighlighter
,因为它可以让您访问位置和偏移数据。
索引要求
要使用这种方法,您需要确保索引数据在创建索引时使用存储术语向量数据的字段:
final String fieldName = "body";
// a shorter version of the input data in the question, for testing:
final String content = "State WEAPONRY 123 War WORD1 Force And Book4 Book WEAPONRY";
FieldType fieldType = new FieldType();
fieldType.setStored(true);
fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
fieldType.setStoreTermVectors(true);
fieldType.setStoreTermVectorPositions(true);
fieldType.setStoreTermVectorOffsets(true);
doc.add(new Field(fieldName, content, fieldType));
writer.addDocument(doc);
(如果您尚未捕获术语向量,这可能会显着增加索引数据的大小。)
图书馆要求
快速矢量荧光笔是 lucene-highlighter
库的一部分:
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-highlighter</artifactId>
<version>8.9.0</version>
</dependency>
搜索示例
假设以下查询:
final String searchTerm = "\"War Force\"~1";
我们希望它能从我们的测试数据中找到 War WORD1 Force
。
流程的第一部分执行标准查询执行,使用经典查询解析器:
Directory dir = FSDirectory.open(Paths.get(indexPath));
try ( DirectoryReader dirReader = DirectoryReader.open(dir)) {
IndexSearcher indexSearcher = new IndexSearcher(dirReader);
Analyzer analyzer = new StandardAnalyzer();
QueryParser parser = new QueryParser(fieldName, analyzer);
Query query = parser.parse(searchTerm);
TopDocs topDocs = indexSearcher.search(query, 100);
ScoreDoc[] hits = topDocs.scoreDocs;
for (ScoreDoc hit : hits) {
handleHit(hit, query, dirReader, indexSearcher);
}
handleHit()
方法(如下所示)是我们使用 FastVectorHighlighter
.
的地方
如果只想进行高亮显示(不需要position/offset数据),可以使用:
FastVectorHighlighter fvh = new FastVectorHighlighter();
fvh.getBestFragment(fieldQuery, dirReader, docId, fieldName, fragCharSize)
但是要访问我们需要的额外数据,您可以执行以下操作:
FieldTermStack fieldTermStack = new FieldTermStack(dirReader, hit.doc, fieldName, fieldQuery);
FieldPhraseList fieldPhraseList = new FieldPhraseList(fieldTermStack, fieldQuery);
FragListBuilder fragListBuilder = new SimpleFragListBuilder();
FragmentsBuilder fragmentsBuilder = new SimpleFragmentsBuilder();
FastVectorHighlighter fvh = new FastVectorHighlighter(phraseHighlight, fieldMatch,
fragListBuilder, fragmentsBuilder);
这将构建一个 FastVectorHighlighter
,其中包含一个 FieldPhraseList
,它将由荧光笔填充。
getBestFragment
方法现在变为:
// use whatever you want for these settings:
int fragCharSize = 100;
int maxNumFragments = 100;
String[] preTags = new String[]{"-->"};
String[] postTags = new String[]{"<--"};
Encoder encoder = new DefaultEncoder();
// the fragments string array contains the highlighted results:
String[] fragments = fvh.getBestFragments(fieldQuery, dirReader, hit.doc,
fieldName, fragCharSize, maxNumFragments, fragListBuilder,
fragmentsBuilder, preTags, postTags, encoder);
最后我们可以使用 fieldPhraseList
访问我们需要的数据:
// the following gives you access to positions and offsets:
fieldPhraseList.getPhraseList().forEach(weightedPhraseInfo -> {
int phraseStartOffset = weightedPhraseInfo.getStartOffset(); // 19
int phraseEndOffset = weightedPhraseInfo.getEndOffset(); // 34
weightedPhraseInfo.getTermsInfos().forEach(termInfo -> {
String term = termInfo.getText(); // "war" "force"
int termPosition = termInfo.getPosition() + 1; // 4 6
int termStartOffset = termInfo.getStartOffset(); // 19 29
int termEndOffset = termInfo.getEndOffset(); // 22 34
});
});
phraseStartOffset
和 phraseEndOffset
是字符计数,告诉我们整个短语在源文档中的位置:
State WEAPONRY 123 War WORD1 Force And Book4 Book WEAPONRY
所以,在我们的例子中,这是从偏移量 19 到 34 的字符串(偏移量 0 是第一个“S”左侧的位置)。
然后,对于搜索查询中的每个特定术语(“war”和“force”),我们可以访问它们的偏移量,以及它们的词位置 (termPosition
)。位置 0 是第一个词,所以我在这个索引上加 1,在原始文档中的位置 4 处给出“war”,在位置 6 处给出“force”:
1 2 3 4 5 6 7 8 9 10
State WEAPONRY 123 War WORD1 Force And Book4 Book WEAPONRY
完整代码供参考:
import java.io.IOException;
import java.math.BigDecimal;
import java.math.RoundingMode;
import java.nio.file.Paths;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.highlight.DefaultEncoder;
import org.apache.lucene.search.highlight.Encoder;
import org.apache.lucene.search.vectorhighlight.FastVectorHighlighter;
import org.apache.lucene.search.vectorhighlight.FieldPhraseList;
import org.apache.lucene.search.vectorhighlight.FieldQuery;
import org.apache.lucene.search.vectorhighlight.FieldTermStack;
import org.apache.lucene.search.vectorhighlight.FragListBuilder;
import org.apache.lucene.search.vectorhighlight.FragmentsBuilder;
import org.apache.lucene.search.vectorhighlight.SimpleFragListBuilder;
import org.apache.lucene.search.vectorhighlight.SimpleFragmentsBuilder;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
public class VectorIndexHighlighterDemo {
final String indexPath = "./index";
final String fieldName = "body";
final String searchTerm = "\"War Force\"~1";
public void doDemo() throws IOException, ParseException {
Directory dir = FSDirectory.open(Paths.get(indexPath));
try ( DirectoryReader dirReader = DirectoryReader.open(dir)) {
IndexSearcher indexSearcher = new IndexSearcher(dirReader);
Analyzer analyzer = new StandardAnalyzer();
QueryParser parser = new QueryParser(fieldName, analyzer);
Query query = parser.parse(searchTerm);
System.out.println();
System.out.println("Search term: [" + searchTerm + "]");
System.out.println("Parsed query: [" + query.toString() + "]");
TopDocs topDocs = indexSearcher.search(query, 100);
ScoreDoc[] hits = topDocs.scoreDocs;
for (ScoreDoc hit : hits) {
handleHit(hit, query, dirReader, indexSearcher);
}
}
}
private void handleHit(ScoreDoc hit, Query query, DirectoryReader dirReader,
IndexSearcher indexSearcher) throws IOException {
boolean phraseHighlight = Boolean.TRUE;
boolean fieldMatch = Boolean.TRUE;
FieldQuery fieldQuery = new FieldQuery(query, dirReader, phraseHighlight, fieldMatch);
FieldTermStack fieldTermStack = new FieldTermStack(dirReader, hit.doc, fieldName, fieldQuery);
FieldPhraseList fieldPhraseList = new FieldPhraseList(fieldTermStack, fieldQuery);
FragListBuilder fragListBuilder = new SimpleFragListBuilder();
FragmentsBuilder fragmentsBuilder = new SimpleFragmentsBuilder();
FastVectorHighlighter fvh = new FastVectorHighlighter(phraseHighlight, fieldMatch,
fragListBuilder, fragmentsBuilder);
// use whatever you want for these settings:
int fragCharSize = 100;
int maxNumFragments = 100;
String[] preTags = new String[]{"-->"};
String[] postTags = new String[]{"<--"};
Encoder encoder = new DefaultEncoder();
// the fragments string array contains the highlighted results:
String[] fragments = fvh.getBestFragments(fieldQuery, dirReader, hit.doc,
fieldName, fragCharSize, maxNumFragments, fragListBuilder,
fragmentsBuilder, preTags, postTags, encoder);
// the following gives you access to positions and offsets:
fieldPhraseList.getPhraseList().forEach(weightedPhraseInfo -> {
int phraseStartOffset = weightedPhraseInfo.getStartOffset(); // 19
int phraseEndOffset = weightedPhraseInfo.getEndOffset(); // 34
weightedPhraseInfo.getTermsInfos().forEach(termInfo -> {
String term = termInfo.getText(); // "war" "force"
int termPosition = termInfo.getPosition() + 1; // 4 6
int termStartOffset = termInfo.getStartOffset(); // 19 29
int termEndOffset = termInfo.getEndOffset(); // 22 34
});
});
// get the scores, also, if needed:
BigDecimal score = new BigDecimal(String.valueOf(hit.score))
.setScale(3, RoundingMode.HALF_EVEN);
Document hitDoc = indexSearcher.doc(hit.doc);
}
}
我有一个 QueryParser,我想在我的文本中找到字符串“War Force”:
TextWord[0]: 2003
TextWord[1]: 09
TextWord[2]: 22T19
TextWord[3]: 01
TextWord[4]: 14Z
TextWord[5]: Book0
TextWord[6]: WEAPONRY
TextWord[7]: NATO2
TextWord[8]: Bar
TextWord[9]: WEAPONRY
TextWord[10]: State
TextWord[11]: WEAPONRY
TextWord[12]: 123
TextWord[13]: War
TextWord[14]: WORD1
TextWord[15]: Force
TextWord[16]: And
TextWord[17]: Book4
TextWord[18]: Book
TextWord[19]: WEAPONRY
TextWord[20]: Book6
TextWord[21]: Terrorist.
TextWord[22]: And
TextWord[23]: WEAPONRY
TextWord[24]: 18
TextWord[25]: 31
TextWord[26]: state
TextWord[27]: AND
我看到我找到了它,当使用短语 slop = 1 时(我的意思是:“war” word1 “force”)。
我可以找到“war”或“force”的位置:
DirectoryReader reader = DirectoryReader.open(this.memoryIndex);
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser queryParser = new QueryParser("tags", new StandardAnalyzer());
Query query = queryParser.parse("\"War Force\"~1");
TopDocs results = searcher.search(query, 1);
for (ScoreDoc scoreDoc : results.scoreDocs) {
Fields termVs = reader.getTermVectors(scoreDoc.doc);
Terms f = termVs.terms("tags");
String searchTerm = "War".toLowerCase();
BytesRef ref = new BytesRef(searchTerm);
TermsEnum te = f.iterator();
PostingsEnum docsAndPosEnum = null;
if (te.seekExact(ref)) {
docsAndPosEnum = te.postings(docsAndPosEnum, PostingsEnum.ALL);
int nextDoc = docsAndPosEnum.nextDoc();
assert nextDoc != DocIdSetIterator.NO_MORE_DOCS;
final int fr = docsAndPosEnum.freq();
final int p = docsAndPosEnum.nextPosition();
final int o = docsAndPosEnum.startOffset();
System.out.println("Word: " + ref.utf8ToString());
System.out.println("Position: " + p + ", startOffset: " + o + " length: " + ref.length + " Freg: " + fr);
if (fr > 1) {
for (int iter = 1; iter <= fr - 1; iter++) {
System.out.println("Possition: " + docsAndPosEnum.nextPosition());
}
}
}
System.out.println("Finish");
}
但我找不到我找到的查询“War Force”或类似查询的位置。如何获取找到的查询结果的位置?
可能有不止一种方法可以做到这一点,但我建议使用 FastVectorHighlighter
,因为它可以让您访问位置和偏移数据。
索引要求
要使用这种方法,您需要确保索引数据在创建索引时使用存储术语向量数据的字段:
final String fieldName = "body";
// a shorter version of the input data in the question, for testing:
final String content = "State WEAPONRY 123 War WORD1 Force And Book4 Book WEAPONRY";
FieldType fieldType = new FieldType();
fieldType.setStored(true);
fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
fieldType.setStoreTermVectors(true);
fieldType.setStoreTermVectorPositions(true);
fieldType.setStoreTermVectorOffsets(true);
doc.add(new Field(fieldName, content, fieldType));
writer.addDocument(doc);
(如果您尚未捕获术语向量,这可能会显着增加索引数据的大小。)
图书馆要求
快速矢量荧光笔是 lucene-highlighter
库的一部分:
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-highlighter</artifactId>
<version>8.9.0</version>
</dependency>
搜索示例
假设以下查询:
final String searchTerm = "\"War Force\"~1";
我们希望它能从我们的测试数据中找到 War WORD1 Force
。
流程的第一部分执行标准查询执行,使用经典查询解析器:
Directory dir = FSDirectory.open(Paths.get(indexPath));
try ( DirectoryReader dirReader = DirectoryReader.open(dir)) {
IndexSearcher indexSearcher = new IndexSearcher(dirReader);
Analyzer analyzer = new StandardAnalyzer();
QueryParser parser = new QueryParser(fieldName, analyzer);
Query query = parser.parse(searchTerm);
TopDocs topDocs = indexSearcher.search(query, 100);
ScoreDoc[] hits = topDocs.scoreDocs;
for (ScoreDoc hit : hits) {
handleHit(hit, query, dirReader, indexSearcher);
}
handleHit()
方法(如下所示)是我们使用 FastVectorHighlighter
.
如果只想进行高亮显示(不需要position/offset数据),可以使用:
FastVectorHighlighter fvh = new FastVectorHighlighter();
fvh.getBestFragment(fieldQuery, dirReader, docId, fieldName, fragCharSize)
但是要访问我们需要的额外数据,您可以执行以下操作:
FieldTermStack fieldTermStack = new FieldTermStack(dirReader, hit.doc, fieldName, fieldQuery);
FieldPhraseList fieldPhraseList = new FieldPhraseList(fieldTermStack, fieldQuery);
FragListBuilder fragListBuilder = new SimpleFragListBuilder();
FragmentsBuilder fragmentsBuilder = new SimpleFragmentsBuilder();
FastVectorHighlighter fvh = new FastVectorHighlighter(phraseHighlight, fieldMatch,
fragListBuilder, fragmentsBuilder);
这将构建一个 FastVectorHighlighter
,其中包含一个 FieldPhraseList
,它将由荧光笔填充。
getBestFragment
方法现在变为:
// use whatever you want for these settings:
int fragCharSize = 100;
int maxNumFragments = 100;
String[] preTags = new String[]{"-->"};
String[] postTags = new String[]{"<--"};
Encoder encoder = new DefaultEncoder();
// the fragments string array contains the highlighted results:
String[] fragments = fvh.getBestFragments(fieldQuery, dirReader, hit.doc,
fieldName, fragCharSize, maxNumFragments, fragListBuilder,
fragmentsBuilder, preTags, postTags, encoder);
最后我们可以使用 fieldPhraseList
访问我们需要的数据:
// the following gives you access to positions and offsets:
fieldPhraseList.getPhraseList().forEach(weightedPhraseInfo -> {
int phraseStartOffset = weightedPhraseInfo.getStartOffset(); // 19
int phraseEndOffset = weightedPhraseInfo.getEndOffset(); // 34
weightedPhraseInfo.getTermsInfos().forEach(termInfo -> {
String term = termInfo.getText(); // "war" "force"
int termPosition = termInfo.getPosition() + 1; // 4 6
int termStartOffset = termInfo.getStartOffset(); // 19 29
int termEndOffset = termInfo.getEndOffset(); // 22 34
});
});
phraseStartOffset
和 phraseEndOffset
是字符计数,告诉我们整个短语在源文档中的位置:
State WEAPONRY 123 War WORD1 Force And Book4 Book WEAPONRY
所以,在我们的例子中,这是从偏移量 19 到 34 的字符串(偏移量 0 是第一个“S”左侧的位置)。
然后,对于搜索查询中的每个特定术语(“war”和“force”),我们可以访问它们的偏移量,以及它们的词位置 (termPosition
)。位置 0 是第一个词,所以我在这个索引上加 1,在原始文档中的位置 4 处给出“war”,在位置 6 处给出“force”:
1 2 3 4 5 6 7 8 9 10
State WEAPONRY 123 War WORD1 Force And Book4 Book WEAPONRY
完整代码供参考:
import java.io.IOException;
import java.math.BigDecimal;
import java.math.RoundingMode;
import java.nio.file.Paths;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.highlight.DefaultEncoder;
import org.apache.lucene.search.highlight.Encoder;
import org.apache.lucene.search.vectorhighlight.FastVectorHighlighter;
import org.apache.lucene.search.vectorhighlight.FieldPhraseList;
import org.apache.lucene.search.vectorhighlight.FieldQuery;
import org.apache.lucene.search.vectorhighlight.FieldTermStack;
import org.apache.lucene.search.vectorhighlight.FragListBuilder;
import org.apache.lucene.search.vectorhighlight.FragmentsBuilder;
import org.apache.lucene.search.vectorhighlight.SimpleFragListBuilder;
import org.apache.lucene.search.vectorhighlight.SimpleFragmentsBuilder;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
public class VectorIndexHighlighterDemo {
final String indexPath = "./index";
final String fieldName = "body";
final String searchTerm = "\"War Force\"~1";
public void doDemo() throws IOException, ParseException {
Directory dir = FSDirectory.open(Paths.get(indexPath));
try ( DirectoryReader dirReader = DirectoryReader.open(dir)) {
IndexSearcher indexSearcher = new IndexSearcher(dirReader);
Analyzer analyzer = new StandardAnalyzer();
QueryParser parser = new QueryParser(fieldName, analyzer);
Query query = parser.parse(searchTerm);
System.out.println();
System.out.println("Search term: [" + searchTerm + "]");
System.out.println("Parsed query: [" + query.toString() + "]");
TopDocs topDocs = indexSearcher.search(query, 100);
ScoreDoc[] hits = topDocs.scoreDocs;
for (ScoreDoc hit : hits) {
handleHit(hit, query, dirReader, indexSearcher);
}
}
}
private void handleHit(ScoreDoc hit, Query query, DirectoryReader dirReader,
IndexSearcher indexSearcher) throws IOException {
boolean phraseHighlight = Boolean.TRUE;
boolean fieldMatch = Boolean.TRUE;
FieldQuery fieldQuery = new FieldQuery(query, dirReader, phraseHighlight, fieldMatch);
FieldTermStack fieldTermStack = new FieldTermStack(dirReader, hit.doc, fieldName, fieldQuery);
FieldPhraseList fieldPhraseList = new FieldPhraseList(fieldTermStack, fieldQuery);
FragListBuilder fragListBuilder = new SimpleFragListBuilder();
FragmentsBuilder fragmentsBuilder = new SimpleFragmentsBuilder();
FastVectorHighlighter fvh = new FastVectorHighlighter(phraseHighlight, fieldMatch,
fragListBuilder, fragmentsBuilder);
// use whatever you want for these settings:
int fragCharSize = 100;
int maxNumFragments = 100;
String[] preTags = new String[]{"-->"};
String[] postTags = new String[]{"<--"};
Encoder encoder = new DefaultEncoder();
// the fragments string array contains the highlighted results:
String[] fragments = fvh.getBestFragments(fieldQuery, dirReader, hit.doc,
fieldName, fragCharSize, maxNumFragments, fragListBuilder,
fragmentsBuilder, preTags, postTags, encoder);
// the following gives you access to positions and offsets:
fieldPhraseList.getPhraseList().forEach(weightedPhraseInfo -> {
int phraseStartOffset = weightedPhraseInfo.getStartOffset(); // 19
int phraseEndOffset = weightedPhraseInfo.getEndOffset(); // 34
weightedPhraseInfo.getTermsInfos().forEach(termInfo -> {
String term = termInfo.getText(); // "war" "force"
int termPosition = termInfo.getPosition() + 1; // 4 6
int termStartOffset = termInfo.getStartOffset(); // 19 29
int termEndOffset = termInfo.getEndOffset(); // 22 34
});
});
// get the scores, also, if needed:
BigDecimal score = new BigDecimal(String.valueOf(hit.score))
.setScale(3, RoundingMode.HALF_EVEN);
Document hitDoc = indexSearcher.doc(hit.doc);
}
}