Revelants 查询使用 Solr 自动完成的建议
Revelants queries suggestion for autocomplete with Solr
我将 Solr 6.4 与 Haystack 2.6.1、pySolr 3.6 一起使用:
我正在寻找 google 类似的自动完成建议。实际上使用 EdgeNGram 效果很好,但它 return 只是我的文档标题不是我想要的:
示例:
typing: 'new y'
return:
New york, fabulous city that never sleep
A trip to new york by night
...
这让用户只能选择 select 建议列表中的特定文档,搜索将 return 仅根据建议的标题搜索文档。
我想要的是相关词的建议,例如:
typing: 'new y'
return:
new york
new york by night
new york city
trip to new york
有一篇文章建议用户使用 return 结果的索引查询,然后将这些查询用作建议:
https://lucidworks.com/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
这意味着解析 solr 日志或使用数据库中一堆保存的用户查询的数据导入 (DIH)。
实际上这篇文章很老了(2009 年),从那时起 Solr 就给我们带来了 Suggester (https://cwiki.apache.org/confluence/display/solr/Suggester)
无论如何,我想知道是否真的有关于如何将 Suggester 与相关查询一起使用而不是 returning 我的文档标题的好教程,而无需将用户的查询保存在数据库中,通过计划过程导入它们、重建索引等
我的search_indexes.py
class ArticleIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
created = indexes.DateTimeField(model_attr='created')
rating = indexes.IntegerField(model_attr='rating')
title = indexes.CharField(model_attr='title', boost=1.125)
term = indexes.EdgeNgramField(model_attr='title')
def get_model(self):
return Article
我的article_text.txt
{{ object.title }}
{{ object.created }}
{{ object.rating }}
我的schema.xml
<field name="term" type="text_general" indexed="true" stored="true" />
<field name="weight" type="float" indexed="true" stored="true" />
<fieldType name="edge_ngram" class="solr.TextField" positionIncrementGap="1">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front" />
</analyzer>
</fieldType>
<fieldType name="suggestType" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[^a-zA-Z0-9]" replacement=" " />
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
我的solrconfig.xml
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy" >
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.dictionary">infixSuggester</str>
<str name="suggest.onlyMorePopular">true</str>
<str name="suggest.count">10</str>
<str name="suggest.collate">true</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">infixSuggester</str>
<str name="lookupImpl">AnalyzingInfixLookupFactory</str>
<str name="indexPath">infix_suggestions</str>
<str name="highlight">false</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">term</str>
<str name="weightField">weight</str>
<str name="suggestAnalyzerFieldType">suggestType</str>
<str name="buildOnStartup">false</str>
<str name="buildOnCommit">false</str>
</lst>
</searchComponent>
我使用 pysolr 查询 Solr,因为 Haystack 还没有实现建议方法:
from pysolr import Solr
solr = Solr(settings.HAYSTACK_CONNECTIONS['default']['URL'], search_handler='/suggest', use_qt_param=False)
raw_results = solr.search('', **{'suggest.q': query_string})
根据您的需要,我建议使用如下设置的 BlendedInfixLookupFactory:
在 schema.xml 中,创建一个您将用于建议者的字段,然后复制到该字段中:
<field name="title" type="text_general" indexed="true" stored="true" />
<field name="term_suggest" type="phrase_suggest" indexed="true" stored="true" multiValued="true"/>
<copyField source="title" dest="term_suggest"/>
<fieldType name="phrase_suggest" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_suggest" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
然后在solrconfig.xml文件中:
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">suggest</str>
<str name="lookupImpl">BlendedInfixLookupFactory</str>
<str name="blenderType">linear</str>
<str name="dictionaryimpl">DocumentDictionaryFactory</str>
<str name="field">term_suggest</str>
<str name="weightField">weight</str>
<str name="suggestAnalyzerFieldType">text_suggest</str>
<str name="queryAnalyzerFieldType">phrase_suggest</str>
<str name="indexPath">suggest</str>
<str name="buildOnStartup">false</str>
<str name="buildOnCommit">false</str>
<bool name="exactMatchFirst">true</bool>
</lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="wt">json</str>
<str name="indent">false</str>
<str name="suggest">true</str>
<str name="suggest.count">10</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
借助 BlendedInfixLookupFactory,您可以在字段中的任何位置找到 "new y",从而为开始时出现的内容赋予更大的权重。将标准标记器用于 suggestAnalyzerFieldType 和关键字标记器用于 queryAnalyzerFieldType 的组合将使您可以使用空格进行搜索(查询 "new y" 将被读取为字符串或关键字)。
你发的confluence wikilink不错,最后修改时间是2016年9月。
编辑:
我不知道你不想要整个标题。您可以尝试为此使用带状疱疹,方法是将上述模式中的 phrase_suggest fieldType 更改为:
<fieldType name="phrase_suggest" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.ShingleFilterFactory"
minShingleSize="2"
maxShingleSize="4"
outputUnigrams="true"
outputUnigramsIfNoShingles="true"/>
</analyzer>
</fieldType>
编辑2:
或者,您可以将 phrase_suggest 与标准分词器结合使用,并为索引分析器使用 shingle 过滤器,为查询分析器使用关键字分词器:
<fieldType name="phrase_suggest" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.ShingleFilterFactory"
minShingleSize="2"
maxShingleSize="4"
outputUnigrams="true"
outputUnigramsIfNoShingles="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
那么对于建议的searchComponent,你只需要:
<str name="suggestAnalyzerFieldType">phrase_suggest</str>
(并且没有 queryAnalyzerFieldType)。当然,您需要更改 ShingleFilterFactory 设置以满足您的需要。
经过几个小时的努力,我终于得到了一些东西。不完美但足够好。
根据这篇文章:
http://alexbenedetti.blogspot.fr/2015/07/solr-you-complete-me.html
我使用了 FreeTextLookupFactory
我的search_indexes.py
class ArticleIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
created = indexes.DateTimeField(model_attr='created')
rating = indexes.IntegerField(model_attr='rating')
title = indexes.CharField(model_attr='title', boost=1.125)
def get_model(self):
return Article
我的schema.xml
<field name="django_ct" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="django_id" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="text" type="text_en" indexed="true" stored="true" multiValued="false" termVectors="true" />
<field name="rating" type="long" indexed="true" stored="true" multiValued="false"/>
<field name="title" type="text_en" indexed="true" stored="true" multiValued="false"/>
<field name="created" type="date" indexed="true" stored="true" multiValued="false"/>
我的Solrconfig.xml
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">suggest</str>
<str name="lookupImpl">FreeTextLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">title</str>
<str name="ngrams">3</str>
<float name="threshold">0.004</float>
<str name="highlight">false</str>
<str name="buildOnCommit">false</str>
<str name="separator"> </str>
<str name="suggestFreeTextAnalyzerFieldType">text_general</str>
</lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy" >
<lst name="defaults">
<str name="suggest.dictionary">suggest</str>
<str name="suggest">true</str>
<str name="suggest.count">10</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
当我使用 Solr 6.4 时,它默认处于托管模式模式(没有考虑我在 schema.xml 中的更改),我不得不通过添加 [=49] 切换到手动编辑模式=]:
<schemaFactory class="ClassicIndexSchemaFactory"/>
然后重新启动 Solr,使用 Haystack 和 rebuild_index
重建索引
当然还有用 curl 构建建议器:
卷曲 http://127.0.0.1:8983/solr/collection1/suggest?suggest.build=true
最后结果:
curl http://127.0.0.1:8983/solr/collection1/suggest?suggest.q=new%20y
我将尝试深入研究 FreeTextLookupFactory 以查看是否可以使它更准确,但它已经令人满意了。
希望对您有所帮助。
PS:始终关注日志:
http://127.0.0.1:8983/solr/#/~logging
我强烈建议始终在选项卡上打开它。它节省了我数小时的痛苦...
我将 Solr 6.4 与 Haystack 2.6.1、pySolr 3.6 一起使用:
我正在寻找 google 类似的自动完成建议。实际上使用 EdgeNGram 效果很好,但它 return 只是我的文档标题不是我想要的:
示例:
typing: 'new y'
return:
New york, fabulous city that never sleep
A trip to new york by night
...
这让用户只能选择 select 建议列表中的特定文档,搜索将 return 仅根据建议的标题搜索文档。
我想要的是相关词的建议,例如:
typing: 'new y'
return:
new york
new york by night
new york city
trip to new york
有一篇文章建议用户使用 return 结果的索引查询,然后将这些查询用作建议: https://lucidworks.com/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
这意味着解析 solr 日志或使用数据库中一堆保存的用户查询的数据导入 (DIH)。
实际上这篇文章很老了(2009 年),从那时起 Solr 就给我们带来了 Suggester (https://cwiki.apache.org/confluence/display/solr/Suggester)
无论如何,我想知道是否真的有关于如何将 Suggester 与相关查询一起使用而不是 returning 我的文档标题的好教程,而无需将用户的查询保存在数据库中,通过计划过程导入它们、重建索引等
我的search_indexes.py
class ArticleIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
created = indexes.DateTimeField(model_attr='created')
rating = indexes.IntegerField(model_attr='rating')
title = indexes.CharField(model_attr='title', boost=1.125)
term = indexes.EdgeNgramField(model_attr='title')
def get_model(self):
return Article
我的article_text.txt
{{ object.title }}
{{ object.created }}
{{ object.rating }}
我的schema.xml
<field name="term" type="text_general" indexed="true" stored="true" />
<field name="weight" type="float" indexed="true" stored="true" />
<fieldType name="edge_ngram" class="solr.TextField" positionIncrementGap="1">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front" />
</analyzer>
</fieldType>
<fieldType name="suggestType" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[^a-zA-Z0-9]" replacement=" " />
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
我的solrconfig.xml
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy" >
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.dictionary">infixSuggester</str>
<str name="suggest.onlyMorePopular">true</str>
<str name="suggest.count">10</str>
<str name="suggest.collate">true</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">infixSuggester</str>
<str name="lookupImpl">AnalyzingInfixLookupFactory</str>
<str name="indexPath">infix_suggestions</str>
<str name="highlight">false</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">term</str>
<str name="weightField">weight</str>
<str name="suggestAnalyzerFieldType">suggestType</str>
<str name="buildOnStartup">false</str>
<str name="buildOnCommit">false</str>
</lst>
</searchComponent>
我使用 pysolr 查询 Solr,因为 Haystack 还没有实现建议方法:
from pysolr import Solr
solr = Solr(settings.HAYSTACK_CONNECTIONS['default']['URL'], search_handler='/suggest', use_qt_param=False)
raw_results = solr.search('', **{'suggest.q': query_string})
根据您的需要,我建议使用如下设置的 BlendedInfixLookupFactory:
在 schema.xml 中,创建一个您将用于建议者的字段,然后复制到该字段中:
<field name="title" type="text_general" indexed="true" stored="true" />
<field name="term_suggest" type="phrase_suggest" indexed="true" stored="true" multiValued="true"/>
<copyField source="title" dest="term_suggest"/>
<fieldType name="phrase_suggest" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_suggest" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
然后在solrconfig.xml文件中:
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">suggest</str>
<str name="lookupImpl">BlendedInfixLookupFactory</str>
<str name="blenderType">linear</str>
<str name="dictionaryimpl">DocumentDictionaryFactory</str>
<str name="field">term_suggest</str>
<str name="weightField">weight</str>
<str name="suggestAnalyzerFieldType">text_suggest</str>
<str name="queryAnalyzerFieldType">phrase_suggest</str>
<str name="indexPath">suggest</str>
<str name="buildOnStartup">false</str>
<str name="buildOnCommit">false</str>
<bool name="exactMatchFirst">true</bool>
</lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="wt">json</str>
<str name="indent">false</str>
<str name="suggest">true</str>
<str name="suggest.count">10</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
借助 BlendedInfixLookupFactory,您可以在字段中的任何位置找到 "new y",从而为开始时出现的内容赋予更大的权重。将标准标记器用于 suggestAnalyzerFieldType 和关键字标记器用于 queryAnalyzerFieldType 的组合将使您可以使用空格进行搜索(查询 "new y" 将被读取为字符串或关键字)。
你发的confluence wikilink不错,最后修改时间是2016年9月。
编辑: 我不知道你不想要整个标题。您可以尝试为此使用带状疱疹,方法是将上述模式中的 phrase_suggest fieldType 更改为:
<fieldType name="phrase_suggest" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.ShingleFilterFactory"
minShingleSize="2"
maxShingleSize="4"
outputUnigrams="true"
outputUnigramsIfNoShingles="true"/>
</analyzer>
</fieldType>
编辑2: 或者,您可以将 phrase_suggest 与标准分词器结合使用,并为索引分析器使用 shingle 过滤器,为查询分析器使用关键字分词器:
<fieldType name="phrase_suggest" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.ShingleFilterFactory"
minShingleSize="2"
maxShingleSize="4"
outputUnigrams="true"
outputUnigramsIfNoShingles="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
那么对于建议的searchComponent,你只需要:
<str name="suggestAnalyzerFieldType">phrase_suggest</str>
(并且没有 queryAnalyzerFieldType)。当然,您需要更改 ShingleFilterFactory 设置以满足您的需要。
经过几个小时的努力,我终于得到了一些东西。不完美但足够好。
根据这篇文章: http://alexbenedetti.blogspot.fr/2015/07/solr-you-complete-me.html
我使用了 FreeTextLookupFactory
我的search_indexes.py
class ArticleIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
created = indexes.DateTimeField(model_attr='created')
rating = indexes.IntegerField(model_attr='rating')
title = indexes.CharField(model_attr='title', boost=1.125)
def get_model(self):
return Article
我的schema.xml
<field name="django_ct" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="django_id" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="text" type="text_en" indexed="true" stored="true" multiValued="false" termVectors="true" />
<field name="rating" type="long" indexed="true" stored="true" multiValued="false"/>
<field name="title" type="text_en" indexed="true" stored="true" multiValued="false"/>
<field name="created" type="date" indexed="true" stored="true" multiValued="false"/>
我的Solrconfig.xml
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">suggest</str>
<str name="lookupImpl">FreeTextLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">title</str>
<str name="ngrams">3</str>
<float name="threshold">0.004</float>
<str name="highlight">false</str>
<str name="buildOnCommit">false</str>
<str name="separator"> </str>
<str name="suggestFreeTextAnalyzerFieldType">text_general</str>
</lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy" >
<lst name="defaults">
<str name="suggest.dictionary">suggest</str>
<str name="suggest">true</str>
<str name="suggest.count">10</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
当我使用 Solr 6.4 时,它默认处于托管模式模式(没有考虑我在 schema.xml 中的更改),我不得不通过添加 [=49] 切换到手动编辑模式=]:
<schemaFactory class="ClassicIndexSchemaFactory"/>
然后重新启动 Solr,使用 Haystack 和 rebuild_index
重建索引当然还有用 curl 构建建议器: 卷曲 http://127.0.0.1:8983/solr/collection1/suggest?suggest.build=true
最后结果:
curl http://127.0.0.1:8983/solr/collection1/suggest?suggest.q=new%20y
我将尝试深入研究 FreeTextLookupFactory 以查看是否可以使它更准确,但它已经令人满意了。 希望对您有所帮助。
PS:始终关注日志: http://127.0.0.1:8983/solr/#/~logging 我强烈建议始终在选项卡上打开它。它节省了我数小时的痛苦...