Elasticsearch 模糊查询 - 最大编辑不能按预期工作
Elasticsearch fuzzy query - max edits doesn't work as expected
我最近在我们的搜索查询字符串中添加了 "fuzzy operator" 和模糊查询设置,以防止用户输入错误(例如 "zamestnanost" 与 "zamestnani")
POST /my_index/_search
{
"query": {
"query_string": {
"query": "+(content:zamestnanost~)",
"fuzzy_prefix_length": 3,
"fuzzy_min_sim": 0.5,
"fuzzy_max_expansions": 50
}
}
}
据我了解模糊查询设置,fuzzy_min_sim = 0.5
应该允许对原始查询进行 length(query)*0.5
编辑(在本例中为 6
编辑)。
然而,它甚至不匹配 "closer" 个单词(标记),例如
- "zamestnani"
- "zamestnany"
我有这种 st运行ge 的感觉,它仍然只匹配索引中最大的单词。原始查询字符串的 2 次编辑(这是模糊查询中的默认编辑计数)。
我也 运行 对我的查询进行了解释,我认为结果支持这个假设。 _explanation
看起来像这样:
"_explanation": {
"value": 0.057083897,
"description": "sum of:",
"details": [
{
"value": 0.023866946,
"description": "weight(content:zamestnano^0.8 in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.023866946,
"description": "score(doc=0,freq=4.0), product of:",
"details": [
{
"value": 0.66062796,
"description": "queryWeight, product of:",
"details": [
{
"value": 0.8,
"description": "boost"
},
{
"value": 4.624341,
"description": "idf(docFreq=1, maxDocs=75)"
},
{
"value": 0.17857353,
"description": "queryNorm"
}
]
},
{
"value": 0.036127664,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 2,
"description": "tf(freq=4.0), with freq of:",
"details": [
{
"value": 4,
"description": "termFreq=4.0"
}
]
},
{
"value": 4.624341,
"description": "idf(docFreq=1, maxDocs=75)"
},
{
"value": 0.00390625,
"description": "fieldNorm(doc=0)"
}
]
}
]
}
]
},
{
"value": 0.03321695,
"description": "weight(content:zamestnanos^0.9090909 in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.03321695,
"description": "score(doc=0,freq=6.0), product of:",
"details": [
{
"value": 0.7507135,
"description": "queryWeight, product of:",
"details": [
{
"value": 0.9090909,
"description": "boost"
},
{
"value": 4.624341,
"description": "idf(docFreq=1, maxDocs=75)"
},
{
"value": 0.17857353,
"description": "queryNorm"
}
]
},
{
"value": 0.044247173,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 2.4494898,
"description": "tf(freq=6.0), with freq of:",
"details": [
{
"value": 6,
"description": "termFreq=6.0"
}
]
},
{
"value": 4.624341,
"description": "idf(docFreq=1, maxDocs=75)"
},
{
"value": 0.00390625,
"description": "fieldNorm(doc=0)"
}
]
}
]
}
]
}
]
}
只有查询 "zamestnano" 和 "zemestnanos" 是使用模糊查询编辑创建的。
我对模糊查询的设置理解对吗?你能指出我的错误吗?
非常感谢每一个想法!
0.0..1.0
[1.7.0] Deprecated in 1.7.0. Support for similarity will be removed in Elasticsearch 2.0. converted into an edit distance using the formula: length(term) * (1.0 - fuzziness), eg a fuzziness of 0.6 with a term of length 10 would result in an edit distance of 4. Note: in all APIs except for the Fuzzy Like This Query, the maximum allowed edit distance is 2.
最简单的复查方法是使用 validate
API:
GET _validate/query?explain&index=my_index
{
"query": {
"query_string": {
"query": "+(content:zamestnanost~)",
"fuzzy_prefix_length": 3,
"fuzzy_min_sim": 0.5,
"fuzzy_max_expansions": 50
}
}
}
结果如下:
"explanations": [
{
"index": "test",
"valid": true,
"explanation": "+content:zamestnanost~2"
}
]
显示 ES 将在查询中使用的实际编辑距离:zamestnanost~2
。
我最近在我们的搜索查询字符串中添加了 "fuzzy operator" 和模糊查询设置,以防止用户输入错误(例如 "zamestnanost" 与 "zamestnani")
POST /my_index/_search
{
"query": {
"query_string": {
"query": "+(content:zamestnanost~)",
"fuzzy_prefix_length": 3,
"fuzzy_min_sim": 0.5,
"fuzzy_max_expansions": 50
}
}
}
据我了解模糊查询设置,fuzzy_min_sim = 0.5
应该允许对原始查询进行 length(query)*0.5
编辑(在本例中为 6
编辑)。
然而,它甚至不匹配 "closer" 个单词(标记),例如
- "zamestnani"
- "zamestnany"
我有这种 st运行ge 的感觉,它仍然只匹配索引中最大的单词。原始查询字符串的 2 次编辑(这是模糊查询中的默认编辑计数)。
我也 运行 对我的查询进行了解释,我认为结果支持这个假设。 _explanation
看起来像这样:
"_explanation": {
"value": 0.057083897,
"description": "sum of:",
"details": [
{
"value": 0.023866946,
"description": "weight(content:zamestnano^0.8 in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.023866946,
"description": "score(doc=0,freq=4.0), product of:",
"details": [
{
"value": 0.66062796,
"description": "queryWeight, product of:",
"details": [
{
"value": 0.8,
"description": "boost"
},
{
"value": 4.624341,
"description": "idf(docFreq=1, maxDocs=75)"
},
{
"value": 0.17857353,
"description": "queryNorm"
}
]
},
{
"value": 0.036127664,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 2,
"description": "tf(freq=4.0), with freq of:",
"details": [
{
"value": 4,
"description": "termFreq=4.0"
}
]
},
{
"value": 4.624341,
"description": "idf(docFreq=1, maxDocs=75)"
},
{
"value": 0.00390625,
"description": "fieldNorm(doc=0)"
}
]
}
]
}
]
},
{
"value": 0.03321695,
"description": "weight(content:zamestnanos^0.9090909 in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.03321695,
"description": "score(doc=0,freq=6.0), product of:",
"details": [
{
"value": 0.7507135,
"description": "queryWeight, product of:",
"details": [
{
"value": 0.9090909,
"description": "boost"
},
{
"value": 4.624341,
"description": "idf(docFreq=1, maxDocs=75)"
},
{
"value": 0.17857353,
"description": "queryNorm"
}
]
},
{
"value": 0.044247173,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 2.4494898,
"description": "tf(freq=6.0), with freq of:",
"details": [
{
"value": 6,
"description": "termFreq=6.0"
}
]
},
{
"value": 4.624341,
"description": "idf(docFreq=1, maxDocs=75)"
},
{
"value": 0.00390625,
"description": "fieldNorm(doc=0)"
}
]
}
]
}
]
}
]
}
只有查询 "zamestnano" 和 "zemestnanos" 是使用模糊查询编辑创建的。
我对模糊查询的设置理解对吗?你能指出我的错误吗?
非常感谢每一个想法!
0.0..1.0
[1.7.0] Deprecated in 1.7.0. Support for similarity will be removed in Elasticsearch 2.0. converted into an edit distance using the formula: length(term) * (1.0 - fuzziness), eg a fuzziness of 0.6 with a term of length 10 would result in an edit distance of 4. Note: in all APIs except for the Fuzzy Like This Query, the maximum allowed edit distance is 2.
最简单的复查方法是使用 validate
API:
GET _validate/query?explain&index=my_index
{
"query": {
"query_string": {
"query": "+(content:zamestnanost~)",
"fuzzy_prefix_length": 3,
"fuzzy_min_sim": 0.5,
"fuzzy_max_expansions": 50
}
}
}
结果如下:
"explanations": [
{
"index": "test",
"valid": true,
"explanation": "+content:zamestnanost~2"
}
]
显示 ES 将在查询中使用的实际编辑距离:zamestnanost~2
。