Lucene 忽略/覆盖 QueryParser 中的模糊编辑距离

Question

假设以下 QueryParser 在查询字符串中包含 FuzzySearch 术语：

fun fuzzyquery() {
    val query = QueryParser("term", GermanAnalyzer()).parse("field:search~4")
    println(query)
}

生成的查询实际上将具有以下表示形式：

field:search~2

因此，~4 被重写为 ~2。我将代码追溯到以下实现：

QueryParserBase

protected Query newFuzzyQuery(Term term, float minimumSimilarity, int prefixLength) {
    String text = term.text();
    int numEdits = FuzzyQuery.floatToEdits(minimumSimilarity, text.codePointCount(0, text.length()));
    return new FuzzyQuery(term, numEdits, prefixLength);
}

模糊查询

public static int floatToEdits(float minimumSimilarity, int termLen) {
    if (minimumSimilarity >= 1.0F) {
        return (int)Math.min(minimumSimilarity, 2.0F);
    } else {
        return minimumSimilarity == 0.0F ? 0 : Math.min((int)((1.0D - (double)minimumSimilarity) * (double)termLen), 2);
    }
}

很明显，任何大于 2 的值都会重置为 2。为什么会这样，我怎样才能正确地将我想要的模糊编辑距离输入到查询解析器中？

Answer 1

这可能会越过边界变成“不是答案”——但评论（或一些评论）太长了：

这是为什么？

这似乎是一个设计决定。它在文档 here.

中提到

"The value is between 0 and 2"

有一篇旧文章here给出了解释：

"Larger differences are far more expensive to compute efficiently and are not processed by Lucene.".

不过，我不知道这有多正式。

更正式地说，来自 FuzzyQuery class 的 JavaDoc，它指出：

"At most, this query will match terms up to 2 edits. Higher distances (especially with transpositions enabled), are generally not useful and will match a significant amount of the term dictionary."

我怎样才能在查询解析器中正确获取我想要的模糊编辑距离？

你不能，除非你自定义源代码。

我认为最好（至少最差？）的替代方案可能是上面提到的 FuzzyQuery Javadoc：

"If you really want this, consider using an n-gram indexing technique (such as the SpellChecker in the suggest module) instead."

在这种情况下，要付出的代价可能是一个更大的索引——即便如此，n-gram 并不真正等同于编辑距离。不知道能不能满足你的需求

Lucene 忽略/覆盖 QueryParser 中的模糊编辑距离

Lucene ignores / overwrite fuzzy edit distance in QueryParser

lucene