Dismax solr 查询解析器工作很差

Dismax solr query parser working very poorly

我有一个包含 4.5M 文档的非常大的数据库。使用默认查询解析器时,我要查找的文档会按应有的方式出现在结果中。

{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "q":"\"I predict a riot\"",
      "rows":"1"}},
  "response":{
    "numFound":15,"start":0,"docs":[
      {
        "artist":"Kaiser Chiefs",
        "text":"<p>Oh, watchin' the people get lairy<br>It's not very pretty, I tell thee<br>Walkin' through town is quite scary<br>And not very sensible either<br>A friend of a friend he got beaten<br>He looked the wrong way at a policeman<br>Would never have happened to Smeaton<br>An old Leodiensian<br><br>I predict a riot, I predict a riot<br>I predict a riot, I predict a riot<br><br>Oh, I try to get to my taxi<br>A man in a tracksuit attacks me<br>He said that he saw it before me<br>Wants to get things a bit gory<br>Girls scrabble round with no clothes on<br>To borrow a pound for a condom<br>If it wasn't for chip fat, they'd be frozen<br>They're not very sensible<br><br>I predict a riot, I predict a riot<br>I predict a riot, I predict a riot<br><br>And if there's anybody left in here<br>That doesn't want to be out there<br><br>Ow!<br><br>Oh, watchin' the people get lairy<br>It's not very pretty, I tell thee<br>Walkin' through town is quite scary<br>Not very sensible<br><br>I predict a riot, I predict a riot<br>I predict a riot, I predict a riot<br><br>And if there's anybody left in here<br>That doesn't want to be out there<br><br>I predict a riot, I predict a riot<br>I predict a riot, I predict a riot</p>",
        "_ts":6341730138387906561,
        "title":"I predict a riot",
        "id":"redacted"}]
  }}

但是,当我使用所有附加参数切换到 DisMax 查询处理程序时,我得到的是:

{
  "responseHeader": {
  "status": 0,
  "QTime": 1,
  "params": {
    "q": "\"I predict a riot\"",
    "defType": "dismax",
    "ps": "0",
    "qf": "text",
    "echoParams": "all",
    "pf": "text^5",
    "wt": "json"
  }
},
  "response": {
    "numFound": 0,
    "start": 0,
    "docs": []
  }
}

没有...如果我删除引号,它会找到一些非常不相关的结果(一位名为 "I" 的艺术家的歌曲)。如果不清楚 "I predict a riot" 存在于本文档的 text 字段中。偶数次

我是 Solr 新手,我不明白这个查询有什么问题。我尝试将 qf 和 pf 更改为 "artist text title" 但什么也没有。

理想情况下,目标是在所有三个字段中找到匹配项,如果在标题、艺术家或文本中找到的所有单词都以相同的顺序出现,则会有巨大的好处。但即使是这个简单的测试似乎也不是上班。 :-/

谢谢!

编辑:使用这些参数

"params": {
"q": "I predict a riot",
"defType": "dismax",
"qf": "text artist title",
"echoParams": "all",
"pf": "text^5",
"rows": "100",
"wt": "json"
}

这是给我这个调试查询:

"debug": {
"rawquerystring": "I predict a riot",
"querystring": "I predict a riot",
"parsedquery": "(+(DisjunctionMaxQuery((text:I | title:I | artist:I)) DisjunctionMaxQuery((text:predict | title:predict | artist:predict)) DisjunctionMaxQuery((text:a | title:a | artist:a)) DisjunctionMaxQuery((text:riot | title:riot | artist:riot))) DisjunctionMaxQuery(((text:I predict a riot)^5.0)))/no_coord",
"parsedquery_toString": "+((text:I | title:I | artist:I) (text:predict | title:predict | artist:predict) (text:a | title:a | artist:a) (text:riot | title:riot | artist:riot)) ((text:I predict a riot)^5.0)",
"QParser": "DisMaxQParser",
"altquerystring": null,
"boostfuncs": null
}

我得到了糟糕的结果,即一位名为 "I" 的艺术家 - 但不是 kaiser chiefs 歌曲,它在标题中有查询,在文本中有多次查询。

定义:

 <field name="title" type="string" indexed="true" stored="true"/>
 <field name="artist" type="string" indexed="true" stored="true"/>   
 <field name="text" type="string" indexed="true" stored="true"/>

A string 字段仅匹配字段的确切值(意味着大写和空格等)。

要实现您期望的那种匹配,您需要一个文本字段。示例架构中的 text_general / text_en 字段可能可用,至少作为起点,但您可能希望根据查询字段的方式准确调整字段的功能。如果您没有同义词或不想删除停用词,请删除这些行并仅保留分词器和小写过滤器:

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
    <filter class="solr.LowerCaseFilterFactory"/>
</fieldType>

您需要在更改字段类型后重新索引数据。

但我在 qf 中确实有一个包含完整句子的字段? 是的。但是 dismax 查询解析器根据自己的规则对输入进行分词,然后根据这些规则创建一个新的内部查询。您可以看到它将查询字符串扩展为一长串 OR,其中每个术语都单独搜索。由于 自己 没有索引匹配这些术语的标记,因此您没有命中。

如果您使用了 edismax 查询解析器,它也支持 lucene 查询语法,您可以使用 title:"I predict a riot" 至少获得一次命中,但它仍然无法运行如您所料,只需获取一份与标题字符匹配的文档即可。