词干提取和模糊搜索在 Apache Solr 中是否一起工作
Does stemming and fuzzy search work together in Apache Solr
我正在为一个包含 3 到 4 个单词的字段使用 porter filter factory。
例如:"ABC BLOSSOM COMPANY"
我希望在搜索 ABC BLOSSOMING COMPANY 时也能获取上述文件。
当我查询时:
name:ABC AND name:BLOSSOMING AND name:COMPANY
我得到我的结果
这是解析后的查询的样子
+name:southern +name:blossom +name:compani
(Stemmer works fine)
但是当我像这样添加模糊语法和查询时,
name:ABC~1 AND name:BLOSSOMING~1 AND name:COMPANY~1
搜索没有给出任何文档,解析后的查询如下所示
+name:abc~1 +name:blossoming~1 +name:company~2
这清楚地表明词干提取没有发生。
请评论并提供反馈。
TL;DR
不会发生词干提取,因为您使用了 PorterFilter,它不是 MultiTermAwareComponent.
要做什么?
使用实现 MultiTermAwareComponent 接口的 Filters/Normalizers 之一。
说明
您和许多其他人一样,被 Solr 和 Lucense Multiterm 的行为所困扰。 Solr wiki 上有 a good article about this topic。尽管这篇文章已过时,但它仍然适用
One of the surprises for most Solr users is that wildcards queries haven't gone through any analysis. Practically, this means that wildcard (and prefix and range) queries are case sensitive, which is at odds with expectations. As of this SOLR-2438, SOLR-2918, and perhaps SOLR-2921, this behavior is changed.
What's a multiterm you ask? Essentially it's any term that may "point to" more than one real term. For instance, run* could expand to runs, runner, running, runt, etc. Likewise, a range query is really a "multiterm" query as well. Before Solr 3.6, these were completely unprocessed, the application layer usually had to apply any transformations required, for instance lower-casing the input. Running these types of terms through a "normal" query analysis chain leads to all sorts of interesting behavior so was avoided.
好吧,这是在试验时对我有所帮助的配置:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.FlattenGraphFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
(是的,我修改了现有的“text_general”字段,我说我正在试验)
在模糊编辑距离为 2 的情况下使用它,针对术语“忽略”产生了以下结果:
1. Lost in Translation - A faded movie star and a neglected young woman...
2. Election - A high school teacher meets his match in an over-achieving...
3. Annie Hall - Alvy Singer, a divorced Jewish comedian, reflects on his relationship...
这有点好,因为第一个结果是合适的。
然而,如果我在启用模糊搜索的情况下搜索“rescuing”,它什么也不会产生。如果禁用模糊,则结果为:
1. The Searchers - ... a years-long journey to rescue his niece from ...
2. Star Wars - ...while also attempting to rescue Princess Leia from...
因此,模糊+词干提取的结果相当不一致。 Elasticsearch 和 SOLR 一样是基于 Lucene 的,不推荐使用 fuzzy with stemming:
This also means that if using say, a snowball analyzer, a fuzzy search for 'running', will be stemmed to 'run', but will not match the misspelled word 'runninga', which stems to 'runninga', because 'run' is more than 2 edits away from 'runninga'. This can cause quite a bit of confusion, and for this reason, it often makes sense only to use the simple analyzer on text intended for use with fuzzy queries, possibly disabling synonyms as well.
我正在为一个包含 3 到 4 个单词的字段使用 porter filter factory。
例如:"ABC BLOSSOM COMPANY"
我希望在搜索 ABC BLOSSOMING COMPANY 时也能获取上述文件。
当我查询时:
name:ABC AND name:BLOSSOMING AND name:COMPANY
我得到我的结果
这是解析后的查询的样子
+name:southern +name:blossom +name:compani (Stemmer works fine)
但是当我像这样添加模糊语法和查询时,
name:ABC~1 AND name:BLOSSOMING~1 AND name:COMPANY~1
搜索没有给出任何文档,解析后的查询如下所示
+name:abc~1 +name:blossoming~1 +name:company~2
这清楚地表明词干提取没有发生。 请评论并提供反馈。
TL;DR
不会发生词干提取,因为您使用了 PorterFilter,它不是 MultiTermAwareComponent.
要做什么?
使用实现 MultiTermAwareComponent 接口的 Filters/Normalizers 之一。
说明
您和许多其他人一样,被 Solr 和 Lucense Multiterm 的行为所困扰。 Solr wiki 上有 a good article about this topic。尽管这篇文章已过时,但它仍然适用
One of the surprises for most Solr users is that wildcards queries haven't gone through any analysis. Practically, this means that wildcard (and prefix and range) queries are case sensitive, which is at odds with expectations. As of this SOLR-2438, SOLR-2918, and perhaps SOLR-2921, this behavior is changed.
What's a multiterm you ask? Essentially it's any term that may "point to" more than one real term. For instance, run* could expand to runs, runner, running, runt, etc. Likewise, a range query is really a "multiterm" query as well. Before Solr 3.6, these were completely unprocessed, the application layer usually had to apply any transformations required, for instance lower-casing the input. Running these types of terms through a "normal" query analysis chain leads to all sorts of interesting behavior so was avoided.
好吧,这是在试验时对我有所帮助的配置:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.FlattenGraphFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
(是的,我修改了现有的“text_general”字段,我说我正在试验)
在模糊编辑距离为 2 的情况下使用它,针对术语“忽略”产生了以下结果:
1. Lost in Translation - A faded movie star and a neglected young woman...
2. Election - A high school teacher meets his match in an over-achieving...
3. Annie Hall - Alvy Singer, a divorced Jewish comedian, reflects on his relationship...
这有点好,因为第一个结果是合适的。
然而,如果我在启用模糊搜索的情况下搜索“rescuing”,它什么也不会产生。如果禁用模糊,则结果为:
1. The Searchers - ... a years-long journey to rescue his niece from ...
2. Star Wars - ...while also attempting to rescue Princess Leia from...
因此,模糊+词干提取的结果相当不一致。 Elasticsearch 和 SOLR 一样是基于 Lucene 的,不推荐使用 fuzzy with stemming:
This also means that if using say, a snowball analyzer, a fuzzy search for 'running', will be stemmed to 'run', but will not match the misspelled word 'runninga', which stems to 'runninga', because 'run' is more than 2 edits away from 'runninga'. This can cause quite a bit of confusion, and for this reason, it often makes sense only to use the simple analyzer on text intended for use with fuzzy queries, possibly disabling synonyms as well.