在 Solr 索引中搜索连接词
Searching Solr index for concatenated words
我正在为两个类似的用例而苦苦挣扎。
这是我的索引中的示例文档:
{
"id":"E850AC8D844010AFA76203B390DD3135",
"brand_txt_en":"Tom Ford",
"catch_all":["Tom Ford",
"FT 5163",
"Tom Ford",
"FT 5163",
"DARK HAVANA"],
"model_txt_en":"FT 5163",
"brand_txt_en_split":"Tom Ford",
"model_txt_en_split":"FT 5163",
"color_txt_en":"DARK HAVANA",
"material_s":"acetato",
"gender_s":"uomo",
"shape_s":"Wayfarer",
"lens_s":"cerchiata",
"modelkey_s":"86_1_FT 5163",
"sales_i":0,
"brand_s":"Tom Ford",
"model_s":"FT 5163",
"color_s":"DARK HAVANA",
"_version_":1569456572504997895
}
查询:brand_txt_en_split:tomford
没有结果!
字段类型是Solr的默认类型:
<fieldType name="text_en_splitting" class="solr.TextField" autoGeneratePhraseQueries="true" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="1" generateWordParts="1" catenateAll="0" catenateWords="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" catenateNumbers="0" generateNumberParts="1" splitOnCaseChange="1" generateWordParts="1" catenateAll="0" catenateWords="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
我希望 WordDelimiterFilterFactory 通过连接单词生成 "tomford" 标记,但它看起来没有按预期工作。
'inverse' 用例是:
{
... "model_txt_en_split": "The Clubmaster", ...
}
我希望在这个查询之后找到该文档:
会长
我想我应该为后一种情况使用 EdgeNGram 过滤器,但我真的不知道该怎么做。
感谢您的帮助
WordDelimiterFilterFactory
有 catenateWords
和 catenateAll
。它在你拥有的地方工作:
catenateWords: (integer, default 0) If non-zero, maximal runs of word parts will be joined: "hot-spot-sensor's" -> "hotspotsensor"
catenateAll: (0/1, default 0) If non-zero, runs of word and number parts will be joined: "Zap-Master-9000" -> "ZapMaster9000"`
要删除单词之间的 space,请尝试以下过滤器。
<filter class="solr.PatternReplaceFilterFactory" pattern="(\s+)" replacement="" replace="all" />
一旦你 add/update schema.xml。重新启动服务器并重新索引数据。
您可以尝试为您的字段名称使用以下字段类型。
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="4" maxGramSize="25"/>
</analyzer>
输入字符串:"John Oliver W Clane"
要过滤的分词器:"John Oliver W Clane"
输出代币:
"John", "John ", "John O", "John Ol", "John Oli", "John Oli", "John Oliv", "John Olive", "John Oliver", "John Oliver ", "John Oliver W", "John Oliver W "
, "John Oliver W C", "John Oliver W Cl", "John Oliver W Cla", "John Oliver W Clan", "John Oliver W Clane".
您可以尝试使用另一个过滤器。
<filter class="solr.NGramFilterFactory" minGramSize="4" maxGramSize="25"/>
您可以阅读有关分析器和过滤器的更多信息Solr Analyzers and Filters
我正在为两个类似的用例而苦苦挣扎。
这是我的索引中的示例文档:
{
"id":"E850AC8D844010AFA76203B390DD3135",
"brand_txt_en":"Tom Ford",
"catch_all":["Tom Ford",
"FT 5163",
"Tom Ford",
"FT 5163",
"DARK HAVANA"],
"model_txt_en":"FT 5163",
"brand_txt_en_split":"Tom Ford",
"model_txt_en_split":"FT 5163",
"color_txt_en":"DARK HAVANA",
"material_s":"acetato",
"gender_s":"uomo",
"shape_s":"Wayfarer",
"lens_s":"cerchiata",
"modelkey_s":"86_1_FT 5163",
"sales_i":0,
"brand_s":"Tom Ford",
"model_s":"FT 5163",
"color_s":"DARK HAVANA",
"_version_":1569456572504997895
}
查询:brand_txt_en_split:tomford
没有结果!
字段类型是Solr的默认类型:
<fieldType name="text_en_splitting" class="solr.TextField" autoGeneratePhraseQueries="true" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="1" generateWordParts="1" catenateAll="0" catenateWords="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" catenateNumbers="0" generateNumberParts="1" splitOnCaseChange="1" generateWordParts="1" catenateAll="0" catenateWords="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
我希望 WordDelimiterFilterFactory 通过连接单词生成 "tomford" 标记,但它看起来没有按预期工作。
'inverse' 用例是:
{
... "model_txt_en_split": "The Clubmaster", ...
}
我希望在这个查询之后找到该文档: 会长
我想我应该为后一种情况使用 EdgeNGram 过滤器,但我真的不知道该怎么做。
感谢您的帮助
WordDelimiterFilterFactory
有 catenateWords
和 catenateAll
。它在你拥有的地方工作:
catenateWords: (integer, default 0) If non-zero, maximal runs of word parts will be joined: "hot-spot-sensor's" -> "hotspotsensor"
catenateAll: (0/1, default 0) If non-zero, runs of word and number parts will be joined: "Zap-Master-9000" -> "ZapMaster9000"`
要删除单词之间的 space,请尝试以下过滤器。
<filter class="solr.PatternReplaceFilterFactory" pattern="(\s+)" replacement="" replace="all" />
一旦你 add/update schema.xml。重新启动服务器并重新索引数据。
您可以尝试为您的字段名称使用以下字段类型。
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="4" maxGramSize="25"/>
</analyzer>
输入字符串:"John Oliver W Clane"
要过滤的分词器:"John Oliver W Clane"
输出代币:
"John", "John ", "John O", "John Ol", "John Oli", "John Oli", "John Oliv", "John Olive", "John Oliver", "John Oliver ", "John Oliver W", "John Oliver W "
, "John Oliver W C", "John Oliver W Cl", "John Oliver W Cla", "John Oliver W Clan", "John Oliver W Clane".
您可以尝试使用另一个过滤器。
<filter class="solr.NGramFilterFactory" minGramSize="4" maxGramSize="25"/>
您可以阅读有关分析器和过滤器的更多信息Solr Analyzers and Filters