带有 CustomAnalyzer 的 QueryParser 弄乱了 PatternReplaceCharFilter 的使用顺序
QueryParser with CustomAnalyzer messes order of use of PatternReplaceCharFilter
我在 lucene 6.0.0 中使用 org.apache.lucene.queryparser.classic.QueryParser
来使用 CustomAnalyzer
解析查询,如下所示:
public static void testFilmAnalyzer() throws IOException, ParseException {
CustomAnalyzer nameAnalyzer = CustomAnalyzer.builder()
.addCharFilter("patternreplace",
"pattern", "(movie|film|picture).*",
"replacement", "")
.withTokenizer("standard")
.build();
QueryParser qp = new QueryParser("name", nameAnalyzer);
qp.setDefaultOperator(QueryParser.Operator.AND);
String[] strs = {"avatar film fiction", "avatar-film fiction", "avatar-film-fiction"};
for (String str : strs) {
System.out.println("Analyzing \"" + str + "\":");
showTokens(str, nameAnalyzer);
Query q = qp.parse(str);
System.out.println("Parsed query of \"" + str + "\":");
System.out.println(q + "\n");
}
}
private static void showTokens(String text, Analyzer analyzer) throws IOException {
StringReader reader = new StringReader(text);
TokenStream stream = analyzer.tokenStream("name", reader);
CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
stream.reset();
while (stream.incrementToken()) {
System.out.print("[" + term.toString() + "]");
}
stream.close();
System.out.println();
}
当我调用 testFilmAnalyzer
:
时,我得到以下输出
Analyzing "avatar film fiction":
[avatar]
Parsed query of "avatar film fiction":
+name:avatar +name:fiction
Analyzing "avatar-film fiction":
[avatar]
Parsed query of "avatar-film fiction":
+name:avatar +name:fiction
Analyzing "avatar-film-fiction":
[avatar]
Parsed query of "avatar-film-fiction":
name:avatar
分析器似乎以正确的预期顺序(即在标记化之前)使用了 PatternReplaceCharFilter
,而 QueryParser
则在之后使用。有人对此有解释吗?这不是一个错误吗?
不,这不是错误。 CharFilters 总是 在标记化之前应用,无论是在查询时间还是索引时间。
但是,空格在QueryParser语法中是有意义的,完全独立于分析。空格分隔查询的子句,每个子句单独分析。如果您不依赖默认字段,这将更容易看到,在这种情况下,我们需要将查询重写为:avatar-film fiction
,为:name:avatar-film name:fiction
。 "avatar-film" 和 "fiction" 这两个子句中的每一个都单独分析,导致您看到的结果。
尝试使用词组查询:
String[] strs = {"\"avatar film fiction\"", "\"avatar-film fiction\"", "\"avatar-film-fiction\""};
您应该会看到预期的结果。
我在 lucene 6.0.0 中使用 org.apache.lucene.queryparser.classic.QueryParser
来使用 CustomAnalyzer
解析查询,如下所示:
public static void testFilmAnalyzer() throws IOException, ParseException {
CustomAnalyzer nameAnalyzer = CustomAnalyzer.builder()
.addCharFilter("patternreplace",
"pattern", "(movie|film|picture).*",
"replacement", "")
.withTokenizer("standard")
.build();
QueryParser qp = new QueryParser("name", nameAnalyzer);
qp.setDefaultOperator(QueryParser.Operator.AND);
String[] strs = {"avatar film fiction", "avatar-film fiction", "avatar-film-fiction"};
for (String str : strs) {
System.out.println("Analyzing \"" + str + "\":");
showTokens(str, nameAnalyzer);
Query q = qp.parse(str);
System.out.println("Parsed query of \"" + str + "\":");
System.out.println(q + "\n");
}
}
private static void showTokens(String text, Analyzer analyzer) throws IOException {
StringReader reader = new StringReader(text);
TokenStream stream = analyzer.tokenStream("name", reader);
CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
stream.reset();
while (stream.incrementToken()) {
System.out.print("[" + term.toString() + "]");
}
stream.close();
System.out.println();
}
当我调用 testFilmAnalyzer
:
Analyzing "avatar film fiction":
[avatar]
Parsed query of "avatar film fiction":
+name:avatar +name:fiction
Analyzing "avatar-film fiction":
[avatar]
Parsed query of "avatar-film fiction":
+name:avatar +name:fiction
Analyzing "avatar-film-fiction":
[avatar]
Parsed query of "avatar-film-fiction":
name:avatar
分析器似乎以正确的预期顺序(即在标记化之前)使用了 PatternReplaceCharFilter
,而 QueryParser
则在之后使用。有人对此有解释吗?这不是一个错误吗?
不,这不是错误。 CharFilters 总是 在标记化之前应用,无论是在查询时间还是索引时间。
但是,空格在QueryParser语法中是有意义的,完全独立于分析。空格分隔查询的子句,每个子句单独分析。如果您不依赖默认字段,这将更容易看到,在这种情况下,我们需要将查询重写为:avatar-film fiction
,为:name:avatar-film name:fiction
。 "avatar-film" 和 "fiction" 这两个子句中的每一个都单独分析,导致您看到的结果。
尝试使用词组查询:
String[] strs = {"\"avatar film fiction\"", "\"avatar-film fiction\"", "\"avatar-film-fiction\""};
您应该会看到预期的结果。