通过 ElasticSearch 的 Percolate 匹配文档 API 总是 returns 如果注册的查询包含术语则不匹配

Question

我尝试通过 Elasticsearch 使用 Percolator，但我遇到了一个小问题。

假设我们的文档是这样的：

{
    "doc": {
        "full_name": "Pacman"
        "company": "Arcade Game LTD",
        "occupation": "hunter", 
        "tags": ["Computer Games"]
    }
}

我们注册的查询是这样的：

{
    "query": {
        "bool": {
            "must": [
               {
                   "match_phrase":{
                       "occupation":  "hunter"
                   }
               },
               {
                   "terms": {
                       "tags":  [
                           "Computer Games",
                           "Electronic Sports"
                           ],
                       "minimum_match": 1
                   }
               }
            ]
        }
    }
}

我得到：

{
   "took": 3,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "total": 0,
   "matches": []
}

而且我不知道我做错了什么，因为如果我从注册查询中删除 terms 并仅匹配 occupation 它会按预期工作并且我得到一个匹配。

有什么提示吗？

更新 1

好的，我认为@Slam的解决方案是正确的方向，但我还有一些问题：

我更新了标签映射，现在看起来像这样：

"tags": {
    "store": True,
    "analyzer": "snowball",
    "type": "string",
    "index": "analyzed",
    "fields": {
        "raw": {
           "type": "string",
           "index": "not_analyzed"
       }
    }
}

要过滤的新文档：

{
    "doc": {
        "full_name": "Pacman"
        "company": "Arcade Game LTD",
        "occupation": "hunter", 
        "tags.raw": ["Computer Games"]
    }
}

当我尝试将上面的文档与 tags.raw 匹配时，仍然没有找到匹配项。我分析了字段 tags.raw，但它看起来仍会创建标记 computer、games 和 running.

Answer 1

我猜，您对 tags 字段使用了隐式映射（默认分析器）或任何类型的分析器。这意味着，该数据（在您的情况下为“计算机游戏”）被分解为令牌部分并且不再可用于术语搜索，因为现在它在索引中表示为 computer+game 之类的东西。

为了能够对字符串进行术语匹配，您需要将它们映射为未分析（以防止它们被切片为标记），例如

PUT so/pacman/_mapping
{
  "pacman": {
    "properties": {
      "tags": {
        "type": "string",
        "index": "not_analyzed"
      }
    }
  }
}

或将您的 tags 字段设置为 multi-field，喜欢

PUT so/pacman/_mapping
{
  "pacman": {
    "properties": {
      "tags": {
        "type": "string",
        "index": "analyzed",
        "fields": {
          "raw": {
            "type": "string",
            "index": "not_analyzed"
          }
        }
      }
    }
  }
}

并使用

查询文档

GET so/pacman/_search
{
  "query": {
    "terms": {
      "tags.raw": [
        "Computer Games",
        "Running"
      ],
      "minimum_match": 1
    }
  }
}

这种方法让您可以执行文本搜索和术语搜索。

根据您的 更新 1，在您放置正确的映射和过滤器后，例如：

PUT so/.percolator/1
{
  "query": {
    "terms": {
      "tags.raw": [
        "Computer Games",
        "Maze running"
      ]
    }
  }
}

您需要index/percolate 格式类似于

的文档

GET so/pacman/_percolate
{
  "doc": {
    "full_name": "Pacman",
    "company": "Arcade Game LTD",
    "occupation": "hunter", 
    "tags": ["Computer Games"]
  }
}

这里发生了什么。您是 indexing/percolation 字段 tags 的文档（没有提及 raw 或您拥有的任何多字段）。 ES 从 json 中取出这个字段，将 tags.raw 添加到索引（作为整个字符串），同时将其分解为分析的标记，并将它们放入 tag 字段（过程复杂得多，但为了简单起见，让我们在这里传递它）。因此，您不需要管理有关此字段的任何内部内容，您已经在映射中完成了。

当过滤器工作时，它会在索引中查找 tags.raw 字段（因为您为这个“子字段”创建了术语查询），而分析的字段保持不变。

通过 ElasticSearch 的 Percolate 匹配文档 API 总是 returns 如果注册的查询包含术语则不匹配

Matching document by ElasticSearch's Percolate API always returns no matches if registered queries contain terms

python

elasticsearch

pyelasticsearch