Solr 8.8 - 使用 eDisMax 和 EdgeNGramFilter 匹配部分单词时遇到问题

Solr 8.8 - trouble matching partial words with eDisMax and EdgeNGramFilter

我是 Solr 的新手,正在尝试提供与 Solr 8.8.1 的部分单词匹配,但部分匹配没有结果。我已经梳理了博客,但没有运气来解决这个问题。

例如,文档的正文中包含单词longer。指数分析给出lonlonglongelonger。如果我使用 alltext_en:longer 查询 longer,我会得到一个匹配项。但是,如果我使用 alltext_en:longe 查询(例如)longe,则无法匹配。解释其他 returns 0.0 = No matching clauses.

看来我遗漏了一些明显的东西,因为这不是一个复杂的短语查询。

如果我遗漏了任何需要的细节,请提前致歉 - 如果您告诉我还需要知道什么,我会更新问题。

以下是我的托管架构中的相关字段规范:

  <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishPossessiveFilterFactory"/>
      <filter class="solr.PorterStemFilterFactory"/>
      <filter class="solr.EdgeNGramFilterFactory" maxGramSize="15" minGramSize="3"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishPossessiveFilterFactory"/>
      <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>

  <dynamicField name="*_txt_en" type="text_en" indexed="true" stored="true"/>

  <field name="alltext_en" type="text_en" multiValued="true" indexed="true" stored="true"/>
  <copyField source="*_txt_en" dest="alltext_en"/>

这里是solrconfig.xml的相关部分:

  <requestHandler name="/select" class="solr.SearchHandler">
     <lst name="defaults">
       <str name="echoParams">explicit</str>

       <!-- Query settings -->
       <str name="defType">edismax</str>
       <str name="q">*:*</str>
       <str name="q.alt">*:*</str>
       <str name="rows">50</str>
       <str name="fl">*,score,[explain]</str>
       <str name="ps">10</str>

       <!-- Highlighting defaults -->
       <str name="hl">on</str>
       <str name="hl.fl">_text_</str>
       <str name="hl.preserveMulti">true</str>
       <str name="hl.encoder">html</str>
       <str name="hl.simple.pre">&lt;span class="artica-snippet"&gt;</str>
       <str name="hl.simple.post">&lt;/span&gt;</str>

       <!-- Spell checking defaults -->
       <str name="spellcheck">on</str>
       <str name="spellcheck.extendedResults">false</str>
       <str name="spellcheck.count">5</str>
       <str name="spellcheck.alternativeTermCount">2</str>
       <str name="spellcheck.maxResultsForSuggest">5</str>
       <str name="spellcheck.collate">true</str>
       <str name="spellcheck.collateExtendedResults">true</str>
       <str name="spellcheck.maxCollationTries">5</str>
       <str name="spellcheck.maxCollations">3</str>
     </lst>

     <arr name="last-components">
       <str>spellcheck</str>
     </arr>
  </requestHandler>

That stemming filter will modify the tokens in ways you don't predict - and since they only happen on the token you try to match agains the ngrammed tokens when querying, the token might not be what you expect). If you're generating ngrams, stemming filters should usually be removed. I'd also remove the possessive filter (Also, small note - try to avoid using * when formatting text, since it's hard to know if you've used it when querying and the formatting is an error - instead use a backtick to indicate that the text is a code keyword/query.) – MatsLindh

这就是答案 - 我从索引步骤中删除了词干分析器,一切都很好。太棒了,谢谢你,@MatsLindh!