Elasticsearch 自定义分析器被忽略
Elasticsearch custom analyzer being ignored
我正在使用 Elasticsearch 2.2.0,我正在尝试在字段上使用 lowercase
+ asciifolding
过滤器。
这是http://localhost:9200/myindex/
的输出
{
"myindex": {
"aliases": {},
"mappings": {
"products": {
"properties": {
"fold": {
"analyzer": "folding",
"type": "string"
}
}
}
},
"settings": {
"index": {
"analysis": {
"analyzer": {
"folding": {
"token_filters": [
"lowercase",
"asciifolding"
],
"tokenizer": "standard",
"type": "custom"
}
}
},
"creation_date": "1456180612715",
"number_of_replicas": "1",
"number_of_shards": "5",
"uuid": "vBMZEasPSAyucXICur3GVA",
"version": {
"created": "2020099"
}
}
},
"warmers": {}
}
}
当我尝试使用 _analyze
API 测试 folding
自定义过滤器时,这就是我得到的 http://localhost:9200/myindex/_analyze?analyzer=folding&text=%C3%89sta%20est%C3%A1%20loca
[=24= 的输出]
{
"tokens": [
{
"end_offset": 4,
"position": 0,
"start_offset": 0,
"token": "Ésta",
"type": "<ALPHANUM>"
},
{
"end_offset": 9,
"position": 1,
"start_offset": 5,
"token": "está",
"type": "<ALPHANUM>"
},
{
"end_offset": 14,
"position": 2,
"start_offset": 10,
"token": "loca",
"type": "<ALPHANUM>"
}
]
}
如您所见,返回的标记是:Ésta
、está
、loca
而不是 esta
, esta
、loca
。这是怎么回事?好像这个折叠分析器被忽略了。
创建索引时看起来像是一个简单的错字。
在你的 "analysis":{"analyzer":{...}}
块中,这个:
"token_filters": [...]
应该是
"filter": [...]
检查 the documentation 以确认这一点。因为你的过滤器数组没有正确命名,所以 ES 完全忽略了它,只是决定使用 standard
分析器。这是一个使用 Sense chrome 插件编写的小示例。按顺序执行:
DELETE /test
PUT /test
{
"analysis": {
"analyzer": {
"folding": {
"type": "custom",
"filter": [
"lowercase",
"asciifolding"
],
"tokenizer": "standard"
}
}
}
}
GET /test/_analyze
{
"analyzer":"folding",
"text":"Ésta está loca"
}
和最后GET /test/_analyze
的结果:
"tokens": [
{
"token": "esta",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "esta",
"start_offset": 5,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "loca",
"start_offset": 10,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 2
}
]
我正在使用 Elasticsearch 2.2.0,我正在尝试在字段上使用 lowercase
+ asciifolding
过滤器。
这是http://localhost:9200/myindex/
{
"myindex": {
"aliases": {},
"mappings": {
"products": {
"properties": {
"fold": {
"analyzer": "folding",
"type": "string"
}
}
}
},
"settings": {
"index": {
"analysis": {
"analyzer": {
"folding": {
"token_filters": [
"lowercase",
"asciifolding"
],
"tokenizer": "standard",
"type": "custom"
}
}
},
"creation_date": "1456180612715",
"number_of_replicas": "1",
"number_of_shards": "5",
"uuid": "vBMZEasPSAyucXICur3GVA",
"version": {
"created": "2020099"
}
}
},
"warmers": {}
}
}
当我尝试使用 _analyze
API 测试 folding
自定义过滤器时,这就是我得到的 http://localhost:9200/myindex/_analyze?analyzer=folding&text=%C3%89sta%20est%C3%A1%20loca
[=24= 的输出]
{
"tokens": [
{
"end_offset": 4,
"position": 0,
"start_offset": 0,
"token": "Ésta",
"type": "<ALPHANUM>"
},
{
"end_offset": 9,
"position": 1,
"start_offset": 5,
"token": "está",
"type": "<ALPHANUM>"
},
{
"end_offset": 14,
"position": 2,
"start_offset": 10,
"token": "loca",
"type": "<ALPHANUM>"
}
]
}
如您所见,返回的标记是:Ésta
、está
、loca
而不是 esta
, esta
、loca
。这是怎么回事?好像这个折叠分析器被忽略了。
创建索引时看起来像是一个简单的错字。
在你的 "analysis":{"analyzer":{...}}
块中,这个:
"token_filters": [...]
应该是
"filter": [...]
检查 the documentation 以确认这一点。因为你的过滤器数组没有正确命名,所以 ES 完全忽略了它,只是决定使用 standard
分析器。这是一个使用 Sense chrome 插件编写的小示例。按顺序执行:
DELETE /test
PUT /test
{
"analysis": {
"analyzer": {
"folding": {
"type": "custom",
"filter": [
"lowercase",
"asciifolding"
],
"tokenizer": "standard"
}
}
}
}
GET /test/_analyze
{
"analyzer":"folding",
"text":"Ésta está loca"
}
和最后GET /test/_analyze
的结果:
"tokens": [
{
"token": "esta",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "esta",
"start_offset": 5,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "loca",
"start_offset": 10,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 2
}
]