在 5.3.0 中使用自定义分析器创建关键字字符串类型

Question

我有一个字符串，我想将其索引为关键字类型，但使用特殊的逗号分析器：例如：

"San Francisco, Boston, New York" -> "San Francisco", "Boston, "纽约

应该同时被索引和聚合，这样我就可以按桶拆分它。在 5.0.0 之前，以下工作正常：索引设置：

{
     'settings': {
         'analysis': {
             'tokenizer': {
                 'comma': {
                     'type': 'pattern',
                     'pattern': ','
                 }
             },
             'analyzer': {
                'comma': {
                     'type': 'custom',
                     'tokenizer': 'comma'
                 }
             }
         },
     },
}

具有以下映射：

{
    'city': {
        'type': 'string',
        'analyzer': 'comma'
    },
}

现在 5.3.0 and above the analyzer is no longer a valid property for the keyword type, and my understanding is that I want a keyword 在这里输入。如何使用自定义分析器指定可聚合、索引、可搜索的文本类型？

Answer 1

您可以使用 multifields 以两种不同的方式索引相同的字段，一种用于搜索，另一种用于聚合。

我还建议您为 trim 添加一个过滤器并将生成的标记小写以帮助您更好地进行搜索。

映射

PUT commaindex2
    {
        "settings": {
            "analysis": {
                "tokenizer": {
                    "comma": {
                        "type": "pattern",
                        "pattern": ","
                    }
                },
                "analyzer": {
                    "comma": {
                        "type": "custom",
                        "tokenizer": "comma",
                        "filter": ["lowercase", "trim"]
                    }
                }
            }
        },
        "mappings": {
            "city_document": {
                "properties": {
                    "city": {
                        "type": "keyword",
                        "fields": {
                            "city_custom_analyzed": {
                                "type": "text",
                                "analyzer": "comma",
                                "fielddata": true
                            }
                        }
                    }
                }
            }
        }
    }

索引文件

POST commaindex2/city_document
{
  "city" : "san fransisco, new york, london"
}

搜索查询

POST commaindex2/city_document/_search
{
    "query": {
        "bool": {
            "must": [{
                "term": {
                    "city.city_custom_analyzed": {
                        "value": "new york"
                    }
                }
            }]
        }
    },
    "aggs": {
        "terms_agg": {
            "terms": {
                "field": "city",
                "size": 10
            }
        }
    }
}

备注

如果您想运行在索引字段上进行聚合，就像您想要在桶中计算每个城市一样，您可以运行在 city.city_custom_analyzed 字段上进行聚合。

POST commaindex2/city_document/_search
{
    "query": {
        "bool": {
            "must": [{
                "term": {
                    "city.city_custom_analyzed": {
                        "value": "new york"
                    }
                }
            }]
        }
    },
    "aggs": {
        "terms_agg": {
            "terms": {
                "field": "city.city_custom_analyzed",
                "size": 10
            }
        }
    }
}

希望对您有所帮助

Answer 2

由于您使用的是 ES 5.3，我建议采用不同的方法，使用摄取管道在索引时拆分您的字段。

PUT _ingest/pipeline/city-splitter
{
  "description": "City splitter",
  "processors": [
    {
      "split": {
        "field": "city",
        "separator": ","
      }
    },
    {
      "foreach": {
        "field": "city",
        "processor": {
          "trim": {
            "field": "_ingest._value"
          }
        }
      }
    }
  ]
}

然后你可以索引一个新文档：

PUT cities/city/1?pipeline=city-splitter
{ "city" : "San Francisco, Boston, New York" }

最后，您可以 search/sort city 和运行字段 city.keyword 上的聚合，就好像城市已在您的客户端应用程序中拆分一样：

POST cities/_search
{
  "query": {
     "match": {
         "city": "boston"
     }
  },
  "aggs": {
    "cities": {
      "terms": {
        "field": "city.keyword"
      }
    }
  }
}

在 5.3.0 中使用自定义分析器创建关键字字符串类型

Create keyword string type with custom analyzer in 5.3.0

elasticsearch

elasticsearch-5