Elasticsearch 令牌位置变化

Elasticsearch Token Position change

最近对 Elasticsearch 感兴趣 analyzer.I 了解什么是 token graph,start_offset,end_offset,position 和 positionLength。

索引架构

PUT synonym_graph_index
{
"settings": {
  "number_of_replicas": 0,
  "analysis": {
    "analyzer": {
      "synonym_graph_analyzer":{
        "type":"custom",
        "tokenizer":"standard",
        "filter":["synonym_filter"]
      }
    },
    "filter": {
      "synonym_filter":
      {
        "type":"synonym_graph",
        "synonyms":["wi fi => wifi,hotspot,fast network"]
      }
    }
  }
}, 
"mappings": { 
  "properties": {
    "text_field": {
      "type": "text",
     "analyzer": "synonym_graph_analyzer"
    }
  }
}
}

我在其中添加了一个文档。

POST synonym_graph_index/_analyze
{
  "analyzer": "synonym_graph_analyzer"
  , "text": "Airtel wi fi is up and down"
}

分析结果

{
  "tokens" : [
    {
      "token" : "Airtel",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "wifi",
      "start_offset" : 7,
      "end_offset" : 12,
      "type" : "SYNONYM",
      "position" : 1,
      "positionLength" : 2
    },
    {
      "token" : "hotspot",
      "start_offset" : 7,
      "end_offset" : 12,
      "type" : "SYNONYM",
      "position" : 1,
      "positionLength" : 2
    },
    {
      "token" : "fast",
      "start_offset" : 7,
      "end_offset" : 12,
      "type" : "SYNONYM",
      "position" : 1
    },
    {
      "token" : "network",
      "start_offset" : 7,
      "end_offset" : 12,
      "type" : "SYNONYM",
      "position" : 2
    },
    {
      "token" : "is",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "up",
      "start_offset" : 16,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "and",
      "start_offset" : 19,
      "end_offset" : 22,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "down",
      "start_offset" : 23,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 6
    }
  ]
}

为了更好地理解我做了 table。

通过使用上面的 table 我也制作了图表。

network 标记改变了它的 position.Did 它发生是因为我使用了标准标记器并且它分裂了 fast network。还有一件事我想知道在某些情况下 positionlength就不提了。

  1. 是的,同义词只是将输入“替换”为输出,它们不会影响下游的处理(标记化、词干提取等)。
  2. 您的原始字符串“wi fi”有 2 个标记,但一些同义词(“hotspot”)是单个单词,因此它们有 positionLength 表示此标记占据 2 个位置。