Elasticsearch 分词器保留(并连接)"and"

Elasticsearch tokenizer to keep (and concatenate) "and"

我正在尝试制作一个 Elasticsearch 过滤器、分析器和分词器,以便能够规范化搜索,例如:

换句话说,我想规范化我的“and”和“&”查询,同时连接它们之间的词。

我正在考虑制作一个将 "henry & william book" 分解为标记 ["henry & william", "book"] 的分词器,然后制作一个进行以下替换的字符过滤器:

但是,这感觉有点老套。有更好的方法吗?

我不能完全在 analyzer/filter 阶段执行此操作的原因是它运行得太晚了。在我的尝试中,在我的 analyzer/filter 运行之前,Elasticsearch 已经将 "henry & william" 分解为 ["henry", "william"]

您可以巧妙地混合使用在分词器之前启动的两个字符过滤器。第一个字符过滤器会将 and 映射到 &,第二个字符过滤器会去掉 & 并将两个相邻的标记粘合在一起。这种组合还允许您引入其他替代品,例如 |or

PUT test
{
  "settings": {
    "analysis": {
      "char_filter": {
        "and": {
          "type": "mapping",
          "mappings": [
            "and => &"
          ]
        },
        "&": {
          "type": "pattern_replace",
          "pattern": """(\w+)(\s*&\s*)(\w+)""",
          "replacement": ""
        }
      },
      "analyzer": {
        "my-analyzer": {
          "type": "custom",
          "char_filter": [
            "and", "&"
          ],
          "tokenizer": "keyword"
        }
      }
    }
  }
}

这将产生以下结果:

POST test/_analyze
{
  "analyzer": "my-analyzer",
  "text": [
    "henry&william book"
  ]
}

Results =>

{
  "tokens" : [
    {
      "token" : "henrywilliam book",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    }
  ]
}

POST test/_analyze
{
  "analyzer": "my-analyzer",
  "text": [
    "henry & william book"
  ]
}

Results =>

{
  "tokens" : [
    {
      "token" : "henrywilliam book",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    }
  ]
}

POST test/_analyze
{
  "analyzer": "my-analyzer",
  "text": [
    "henry and william book"
  ]
}

Results =>

{
  "tokens" : [
    {
      "token" : "henrywilliam book",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    }
  ]
}

POST test/_analyze
{
  "analyzer": "my-analyzer",
  "text": [
    "henry william book"
  ]
}

Results =>

{
  "tokens" : [
    {
      "token" : "henry william book",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    }
  ]
}

您只需要一个字符过滤器和一些正则表达式知识。字符过滤器用于在将字符流传递到分词器之前对其进行预处理。

{
    "settings": {
        "analysis": {
            "char_filter": {
                "remove_and": {
                    "type": "pattern_replace",
                    "pattern": """\s*(&|\band\b)\s*""",
                    "description": "Removes ands and ampersands"
                }
            },
            "analyzer": {
                "book-analyzer": {
                    "type": "custom",
                    "char_filter": [
                        "remove_and"
                    ],
                    "tokenizer": "keyword"
                }
            }
        }
    }
}

说明:

  • \s* 表达式周围的可选空格
  • \b 'and' 周围的单词边界,例如不要在 candy
  • 中做出反应