对某些关键字使用通配符的奇怪行为

Strange behavior using the wildcard character for certain keywords

在我的 azure 认知搜索索引中,当我搜索术语“教育”时,我获得了 660 次点击。当我搜索“教育”一词时,我也得到了 660 次点击。两者似乎 return 包含该词的两个变体的相同结果。

但是,我在使用通配符时看到了非常奇怪的行为:

edu* returns 660 results (expected)
educ* returns 660 results (expected)
educa* returns 2 results (matches two instances of the hyphenated word "educa-tion")
educat* returns 0 results (unexpected)
educati* returns 0 results (unexpected)
educatio* returns 0 results (unexpected)

每个搜索字段都使用英语 Lucene 语言分析器,queryType 设置为“full”,searchMode 设置为“all”。

为什么最后的结果 return 什么都没有?

顺便说一句,我发现关于在单词开头使用通配符的信息相互矛盾。

lucene 文档说:

Note: You cannot use a * or ? symbol as the first character of a search.

发件人:https://lucene.apache.org/core/2_9_4/queryparsersyntax.html

但在 Microsoft 的网站上,他们似乎暗示它应该可以工作:

Term fragment comes after * or ?, with a forward slash to delimit the construct. For example, search=/.*numeric./ returns "alphanumeric".

发件人:https://docs.microsoft.com/en-us/azure/search/query-lucene-syntax#bkmk_wildcard

我已经尝试过 *ducation(return 是一个错误)和 /.*ducation./(return 是 0 个结果)。

感谢您的帮助。

当您使用英语 Lucene 分析器时,您的内容会被积极地提取。您在“分析器对通配符查询的影响”部分中提供的 link 中对此进行了解释。如果您更改为 Microsoft English 分析器,您的示例应该会按预期工作。

https://docs.microsoft.com/en-us/azure/search/query-lucene-syntax#impact-of-an-analyzer-on-wildcard-queries

If you were to use the en.lucene (English Lucene) analyzer, it would apply aggressive stemming of each term. For example, 'terminate', 'termination', 'terminates' will all be tokenized down to the token 'termi' in your index. On the other side, terms in queries using wildcards or fuzzy search are not analyzed at all., so there would be no results that would match the 'terminat*' query.

On the other side, the Microsoft analyzers (in this case, the en.microsoft analyzer) are a bit more advanced and use lemmatization instead of stemming. This means that all generated tokens should be valid English words. For example, 'terminate', 'terminates' and 'termination' will mostly stay whole in the index, and would be a preferable choice for scenarios that depend a lot on wildcards and fuzzy search.