Solr 查询：单个术语与短语

Question

在我们基于Solr 的搜索中，我们从使用短语开始。例如，当用户键入

blue dress

那么 Solr 查询将是

title:"blue dress" OR description:"blue dress"

我们现在要删除停用词。使用默认的 StopFilterFactory，查询

the blue dress

将匹配包含 "blue dress" 或 "the blue dress".

的文档

但是，在输入时

blue the dress

则不匹配包含"blue dress".

的文档

我开始怀疑我们是否应该只使用单个术语进行搜索。即把上面的用户搜索转换成

title:the OR title:blue OR title:dress OR description:the OR description:blue OR description:dress

虽然我有点不愿意这样做，因为它似乎在做 StandardTokenizerFactory 的工作。

这是我的 schema.xml:

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
      <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="English" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
      <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="English" />
  </analyzer>
</fieldType>

标题和描述字段都是text_general.

类型

单词搜索是Solr的标准搜索方式吗？我是否在调用 Solr 之前通过标记化单词来暴露自己的问题（性能问题，也许）？也许用单个术语和短语来思考是错误的，我们应该让用户自己决定？

Answer 1

您偶然发现的事实是停用词过滤器阻止了停用词的索引，但它们的位置仍然被编入索引。类似于空格符的东西存储在索引中出现停用词的地方。

因此，当您将其放入索引时

the blue dress

它将被索引为

* blue dress

当你交出短语时也会发生同样的情况

"blue the dress"

作为查询。它将被视为

"blue * dress"

现在 Solr 比较这两个片段，但不匹配，因为 * 位置错误。

在 Solr 4.4 之前，这通常是通过在 StopFilterFactory as described by Pascal Dimassimo. Apparently there has been a refactoring that did break that option on the StopFilterFactory as discussed on SO and Solr's Jira.

中设置 enablePositionIncrements="true" 来解决的

更新在阅读 Extended Dis Max Query Parser 的参考文档时，我发现了这个

The stopwords Parameter

A Boolean parameter indicating if the StopFilterFactory configured in the query analyzer should be respected when parsing the query: if it is false, then the StopFilterFactory in the query analyzer is ignored.

我会检查这是否有助于解决问题。

Answer 2

尽管如果查询被拆分为多个 title:term 语句，初始方法可能会起作用，但这很容易出错（因为标记可能被拆分在错误的位置）并且还在重复（可能很糟糕）内置分词器所做的工作。

正确的方法是按原样维护初始查询，并依靠 Solr 配置来正确处理它。这是有道理的，但困难在于我想指定要搜索的字段。事实证明，使用默认查询解析器无法做到这一点，即所谓的 LuceneQParserPlugin (confusingly, there is a parameter called fl，用于字段列表，用于指定返回的字段，而不是要搜索的字段） .

为了完整起见，必须提到可以使用 copyField configuration is schema.xml 模拟要搜索的参数列表。我觉得这不够优雅也不够灵活。

优雅的解决方案是使用 ExtendedDisMax query parser，又名 edismax。有了它，我们可以按原样维护查询，并充分利用架构中的配置。在我们的例子中，它看起来像这样：

        SolrQuery solrQuery = new SolrQuery();
        solrQuery.set("defType", "edismax");
        solrQuery.set("q", query); // ie. "blue the dress"
        solrQuery.set("qf", "description title");

根据this page：

(e)Dismax generally makes the best first choice query parser for user facing Solr applications

如果这确实是默认选择，那将会有所帮助。

Solr 查询：单个术语与短语

Solr Queries: Single Terms versus Phrases

solr

edismax