多术语 solr 同义词问题
Multi term solr synonym issue
我定义的同义词如下:
facebook,fb,face book, face bk
现在当我搜索 facebook 时,解析的查询是
<str name="parsedquery_toString">
text:facebook text:fb text:face text:face text:book text:bk
</str>
但是如果我搜索面子书,那么解析后的查询是
<str name="parsedquery_toString">
text:face text:book
</str>
两个关键字的解析查询不应该相同吗?
这是我的配置片段:
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
enablePositionIncrements="true"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
enablePositionIncrements="true"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
请查找synonym.txt
的内容
#some test synonym mappings unlikely to appear in real input text
aaafoo => aaabar
bbbfoo => bbbfoo bbbbar
cccfoo => cccbar cccbaz
fooaaa,baraaa,bazaaa
# Some synonym groups specific to this example
GB,gib,gigabyte,gigabytes
MB,mib,megabyte,megabytes
facebook,fb,face book, face bk
Television, Televisions, TV, TVs
#notice we use "gib" instead of "GiB" so any WordDelimiterFilter coming
#after us won't split it into two words.
# Synonym mappings can be used for spelling correction too
pixima => pixma
这是 Solr/Lucene 中的一个众所周知的问题,您可以在以下位置找到更多相关信息:
- the lucene ticket
- this blog post,请参阅标题为 Multi-word synonyms won't be matched in queries
的部分
如果你想解决这个问题,你有几个选择:
- 应用以上两个资源中提到的几个 plugings/parsers 之一。不利的是,每次升级 solr 等时都必须重做这些工作。
- 将同义词移动到索引时间。无论如何,这是首选,尽管它有其自身的缺点。
我定义的同义词如下:
facebook,fb,face book, face bk
现在当我搜索 facebook 时,解析的查询是
<str name="parsedquery_toString">
text:facebook text:fb text:face text:face text:book text:bk
</str>
但是如果我搜索面子书,那么解析后的查询是
<str name="parsedquery_toString">
text:face text:book
</str>
两个关键字的解析查询不应该相同吗?
这是我的配置片段:
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
enablePositionIncrements="true"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
enablePositionIncrements="true"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
请查找synonym.txt
的内容#some test synonym mappings unlikely to appear in real input text
aaafoo => aaabar
bbbfoo => bbbfoo bbbbar
cccfoo => cccbar cccbaz
fooaaa,baraaa,bazaaa
# Some synonym groups specific to this example
GB,gib,gigabyte,gigabytes
MB,mib,megabyte,megabytes
facebook,fb,face book, face bk
Television, Televisions, TV, TVs
#notice we use "gib" instead of "GiB" so any WordDelimiterFilter coming
#after us won't split it into two words.
# Synonym mappings can be used for spelling correction too
pixima => pixma
这是 Solr/Lucene 中的一个众所周知的问题,您可以在以下位置找到更多相关信息:
- the lucene ticket
- this blog post,请参阅标题为 Multi-word synonyms won't be matched in queries 的部分
如果你想解决这个问题,你有几个选择:
- 应用以上两个资源中提到的几个 plugings/parsers 之一。不利的是,每次升级 solr 等时都必须重做这些工作。
- 将同义词移动到索引时间。无论如何,这是首选,尽管它有其自身的缺点。