换行符或标点符号作为弹性搜索中的位置间隙

Question

在 elasticsearch 中，有没有办法设置一个分析器，当遇到换行符或标点符号时，会在标记之间产生位置间隙？

假设我用以下无意义的字符串（带有换行符）作为其字段之一索引了一个对象：

The quick brown fox runs after the rabbit.
Then comes the jumpy frog.

标准分析器将产生以下具有相应位置的标记：

0 the
1 quick
2 brown
3 fox
4 runs
5 after
6 the
7 rabbit
8 then
9 comes
10 the
11 jumpy
12 frog

这意味着 the rabbit then comes 的 match_phrase 查询将匹配此文档作为命中。有没有办法在 rabbit 和 then 之间引入一个位置间隙，以便它不匹配，除非引入 slop？

当然，解决方法可能是将单个字符串转换为一个数组（每个条目一行）并在字段映射中使用 position_offset_gap，但我真的宁愿保留一个带有换行符的字符串（和最终的解决方案是换行符的位置间隙比标点符号更大。

Answer 1

我最终想出了一个解决方案，使用 char_filter 在换行符和标点符号上引入额外的标记：

PUT /index
{                                              
  "settings": {
    "analysis": {
      "char_filter": {
        "my_mapping": {
          "type": "mapping",
          "mappings": [ ".=>\n_PERIOD_\n", "\n=>\n_NEWLINE_\n" ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": ["my_mapping"],
          "filter": ["lowercase"]
        }
      }
    }
  }
}

使用示例字符串进行测试

POST /index/_analyze?analyzer=my_analyzer&pretty
The quick brown fox runs after the rabbit.
Then comes the jumpy frog.

产生以下结果：

{
  "tokens" : [ {
    "token" : "the",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
... snip ...
    "token" : "rabbit",
    "start_offset" : 35,
    "end_offset" : 41,
    "type" : "<ALPHANUM>",
    "position" : 8
  }, {
    "token" : "_period_",
    "start_offset" : 41,
    "end_offset" : 41,
    "type" : "<ALPHANUM>",
    "position" : 9
  }, {
    "token" : "_newline_",
    "start_offset" : 42,
    "end_offset" : 42,
    "type" : "<ALPHANUM>",
    "position" : 10
  }, {
    "token" : "then",
    "start_offset" : 43,
    "end_offset" : 47,
    "type" : "<ALPHANUM>",
    "position" : 11
... snip ...
  } ]
}

换行符或标点符号作为弹性搜索中的位置间隙

line breaks or punctuation marks as position gaps in elasticsearch

analyzer

elasticsearch