Elasticsearch 分词器保留（并连接）"and"

Question

我正在尝试制作一个 Elasticsearch 过滤器、分析器和分词器，以便能够规范化搜索，例如：

"henry&william book" -> "henrywilliam book"
"henry & william book" -> "henrywilliam book"
"henry and william book" -> "henrywilliam book"
"henry william book" -> "henry william book"

换句话说，我想规范化我的“and”和“&”查询，同时连接它们之间的词。

我正在考虑制作一个将 "henry & william book" 分解为标记 ["henry & william", "book"] 的分词器，然后制作一个进行以下替换的字符过滤器：

" & " -> ""
" and " -> ""
"&" -> ""

但是，这感觉有点老套。有更好的方法吗？

我不能完全在 analyzer/filter 阶段执行此操作的原因是它运行得太晚了。在我的尝试中，在我的 analyzer/filter 运行之前，Elasticsearch 已经将 "henry & william" 分解为 ["henry", "william"]。

Answer 1

您可以巧妙地混合使用在分词器之前启动的两个字符过滤器。第一个字符过滤器会将 and 映射到 &，第二个字符过滤器会去掉 & 并将两个相邻的标记粘合在一起。这种组合还允许您引入其他替代品，例如 | 和 or。

PUT test
{
  "settings": {
    "analysis": {
      "char_filter": {
        "and": {
          "type": "mapping",
          "mappings": [
            "and => &"
          ]
        },
        "&": {
          "type": "pattern_replace",
          "pattern": """(\w+)(\s*&\s*)(\w+)""",
          "replacement": ""
        }
      },
      "analyzer": {
        "my-analyzer": {
          "type": "custom",
          "char_filter": [
            "and", "&"
          ],
          "tokenizer": "keyword"
        }
      }
    }
  }
}

这将产生以下结果：

POST test/_analyze
{
  "analyzer": "my-analyzer",
  "text": [
    "henry&william book"
  ]
}

Results =>

{
  "tokens" : [
    {
      "token" : "henrywilliam book",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    }
  ]
}

POST test/_analyze
{
  "analyzer": "my-analyzer",
  "text": [
    "henry & william book"
  ]
}

Results =>

{
  "tokens" : [
    {
      "token" : "henrywilliam book",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    }
  ]
}

POST test/_analyze
{
  "analyzer": "my-analyzer",
  "text": [
    "henry and william book"
  ]
}

Results =>

{
  "tokens" : [
    {
      "token" : "henrywilliam book",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    }
  ]
}

POST test/_analyze
{
  "analyzer": "my-analyzer",
  "text": [
    "henry william book"
  ]
}

Results =>

{
  "tokens" : [
    {
      "token" : "henry william book",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    }
  ]
}

Answer 2

您只需要一个字符过滤器和一些正则表达式知识。字符过滤器用于在将字符流传递到分词器之前对其进行预处理。

{
    "settings": {
        "analysis": {
            "char_filter": {
                "remove_and": {
                    "type": "pattern_replace",
                    "pattern": """\s*(&|\band\b)\s*""",
                    "description": "Removes ands and ampersands"
                }
            },
            "analyzer": {
                "book-analyzer": {
                    "type": "custom",
                    "char_filter": [
                        "remove_and"
                    ],
                    "tokenizer": "keyword"
                }
            }
        }
    }
}

说明：

\s* 表达式周围的可选空格
\b 'and' 周围的单词边界，例如不要在 candy

Elasticsearch 分词器保留（并连接）"and"

Elasticsearch tokenizer to keep (and concatenate) "and"

elasticsearch

elasticsearch-7