聚合中的弹性同义词使用
Elastic synonym usage in aggregations
情况:
使用的 Elastic 版本:2.3.1
我有一个像这样配置的弹性索引
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"british,english",
"queen,monarch"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
}
}
太棒了,当我查询文档并使用查询词“english”或“queen”时,我得到了所有文档匹配 british 和 monarch。当我在过滤器聚合中使用同义词时,它不起作用。例如
在我的索引中有 5 个文件,其中 3 个有君主,2 个有女王
POST /my_index/_search
{
"size": 0,
"query" : {
"match" : {
"status.synonym":{
"query": "queen",
"operator": "and"
}
}
},
"aggs" : {
"status_terms" : {
"terms" : { "field" : "status.synonym" }
},
"monarch_filter" : {
"filter" : { "term": { "status.synonym": "monarch" } }
}
},
"explain" : 0
}
结果产生:
总点击数:
- 5 个文档数(正如预期的那样,太棒了!)
- 状态条款:女王 5 名医生(正如预期的那样,太棒了!)
- Monarch 过滤器:0 文档计数
我尝试过不同的同义词过滤器配置:
- 女王,君主
- 女王,君主 => 女王
- 女王,君主 => 女王,君主
但是以上并没有改变结果。我想得出结论,也许您只能在查询时使用过滤器,但是如果术语聚合有效,为什么不应该过滤,因此我认为它的同义词过滤器配置是错误的。可以找到更广泛的同义词过滤器示例 here.
问题:
如何use/configure筛选聚合中的同义词?
复制上述案例的示例:
1. 创建并配置索引:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"wlh,wellhead=>wellwell"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
}
}
PUT my_index/_mapping/job
{
"properties": {
"title":{
"type": "string",
"analyzer": "my_synonyms"
}
}
}
2.Put 两个文件:
PUT my_index/job/1
{
"title":"wellhead smth else"
}
PUT my_index/job/2
{
"title":"wlh other stuff"
}
3.Execute 在 wlh 上搜索应该 return 2 个文档;有一个 terms 聚合,它应该有 2 个文件 wellwell 和一个不应该有 0 计数的过滤器:
POST my_index/_search
{
"size": 0,
"query" : {
"match" : {
"title":{
"query": "wlh",
"operator": "and"
}
}
},
"aggs" : {
"wlhAggs" : {
"terms" : { "field" : "title" }
},
"wlhFilter" : {
"filter" : { "term": { "title": "wlh" } }
}
},
"explain" : 0
}
本次查询结果为:
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"wlhAggs": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "wellwell",
"doc_count": 2
},
{
"key": "else",
"doc_count": 1
},
{
"key": "other",
"doc_count": 1
},
{
"key": "smth",
"doc_count": 1
},
{
"key": "stuff",
"doc_count": 1
}
]
},
"wlhFilter": {
"doc_count": 0
}
}
}
这就是我的问题,wlhFilter 中应该至少有 1 个文档计数。
我时间有限,所以如果需要,我可以稍后详细说明 today/tomorrow。但以下应该有效:
DELETE /my_index
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"british,english",
"queen,monarch"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_synonyms",
"fielddata": true
}
}
}
}
}
POST my_index/test/1
{
"title" : "the british monarch"
}
GET my_index/_search
{
"query": {
"match": {
"title": "queen"
}
}
}
GET my_index/_search
{
"query": {
"match": {
"title": "queen"
}
},
"aggs": {
"queen_filter": {
"filter": {
"term": {
"title": "queen"
}
}
},
"monarch_filter": {
"filter": {
"term": {
"title": "monarch"
}
}
}
}
}
能否分享您为 status.synonym
字段定义的映射?
编辑:V2
过滤器输出为 0 的原因是 Elasticsearch 中的过滤器从不经过分析阶段。它用于精确匹配。
聚合中的标记 'wlh' 不会被转换为 'wellwell',这意味着它不会出现在倒排索引中。这是因为,在索引时间内,您的 'wlh' 被翻译成 'wellwell'。
为了实现你想要的,你必须将数据索引到一个单独的字段中并相应地调整你的过滤器。
您可以尝试类似的方法:
DELETE my_index
PUT /my_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"wlh,wellhead=>wellwell"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
},
"mappings": {
"job": {
"properties": {
"title": {
"type": "string",
"fields": {
"synonym": {
"type": "string",
"analyzer": "my_synonyms"
}
}
}
}
}
}
}
PUT my_index/job/1
{
"title":"wellhead smth else"
}
PUT my_index/job/2
{
"title":"wlh other stuff"
}
POST my_index/_search
{
"size": 0,
"query": {
"match": {
"title.synonym": {
"query": "wlh",
"operator": "and"
}
}
},
"aggs": {
"wlhAggs": {
"terms": {
"field": "title.synonym"
}
},
"wlhFilter": {
"filter": {
"term": {
"title": "wlh"
}
}
}
}
}
输出:
{
"aggregations": {
"wlhAggs": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "wellwell",
"doc_count": 2
},
{
"key": "else",
"doc_count": 1
},
{
"key": "other",
"doc_count": 1
},
{
"key": "smth",
"doc_count": 1
},
{
"key": "stuff",
"doc_count": 1
}
]
},
"wlhFilter": {
"doc_count": 1
}
}
}
希望对您有所帮助!!
因此,在下面@Byron Voorbach 的帮助和他的评论下,这是我的解决方案:
- 我创建了一个单独的字段,我在上面使用同义词分析器,因为
反对拥有 属性 字段 (mainfield.property).
- 最重要的问题是我的同义词被缩减了!我
例如,有 british,english => uk。将其更改为
british,english,uk 解决了我的问题,过滤器聚合是
返回正确数量的文档。
希望这对某人有所帮助,或者至少指明了正确的方向。
编辑:
哦,上帝赞美文档!我完全解决了过滤器 (S!) 聚合 (link here) 的问题。在过滤器配置中,我指定了查询的匹配类型并且它起作用了!结果是这样的:
"aggs" : {
"messages" : {
"filters" : {
"filters" : {
"status" : { "match" : { "cats.saurus" : "monarch" }},
"country" : { "match" : { "cats.saurus" : "british" }}
}
}
}
}
情况:
使用的 Elastic 版本:2.3.1
我有一个像这样配置的弹性索引
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"british,english",
"queen,monarch"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
}
}
太棒了,当我查询文档并使用查询词“english”或“queen”时,我得到了所有文档匹配 british 和 monarch。当我在过滤器聚合中使用同义词时,它不起作用。例如
在我的索引中有 5 个文件,其中 3 个有君主,2 个有女王
POST /my_index/_search
{
"size": 0,
"query" : {
"match" : {
"status.synonym":{
"query": "queen",
"operator": "and"
}
}
},
"aggs" : {
"status_terms" : {
"terms" : { "field" : "status.synonym" }
},
"monarch_filter" : {
"filter" : { "term": { "status.synonym": "monarch" } }
}
},
"explain" : 0
}
结果产生:
总点击数:
- 5 个文档数(正如预期的那样,太棒了!)
- 状态条款:女王 5 名医生(正如预期的那样,太棒了!)
- Monarch 过滤器:0 文档计数
我尝试过不同的同义词过滤器配置:
- 女王,君主
- 女王,君主 => 女王
- 女王,君主 => 女王,君主
但是以上并没有改变结果。我想得出结论,也许您只能在查询时使用过滤器,但是如果术语聚合有效,为什么不应该过滤,因此我认为它的同义词过滤器配置是错误的。可以找到更广泛的同义词过滤器示例 here.
问题:
如何use/configure筛选聚合中的同义词?
复制上述案例的示例: 1. 创建并配置索引:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"wlh,wellhead=>wellwell"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
}
}
PUT my_index/_mapping/job
{
"properties": {
"title":{
"type": "string",
"analyzer": "my_synonyms"
}
}
}
2.Put 两个文件:
PUT my_index/job/1
{
"title":"wellhead smth else"
}
PUT my_index/job/2
{
"title":"wlh other stuff"
}
3.Execute 在 wlh 上搜索应该 return 2 个文档;有一个 terms 聚合,它应该有 2 个文件 wellwell 和一个不应该有 0 计数的过滤器:
POST my_index/_search
{
"size": 0,
"query" : {
"match" : {
"title":{
"query": "wlh",
"operator": "and"
}
}
},
"aggs" : {
"wlhAggs" : {
"terms" : { "field" : "title" }
},
"wlhFilter" : {
"filter" : { "term": { "title": "wlh" } }
}
},
"explain" : 0
}
本次查询结果为:
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"wlhAggs": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "wellwell",
"doc_count": 2
},
{
"key": "else",
"doc_count": 1
},
{
"key": "other",
"doc_count": 1
},
{
"key": "smth",
"doc_count": 1
},
{
"key": "stuff",
"doc_count": 1
}
]
},
"wlhFilter": {
"doc_count": 0
}
}
}
这就是我的问题,wlhFilter 中应该至少有 1 个文档计数。
我时间有限,所以如果需要,我可以稍后详细说明 today/tomorrow。但以下应该有效:
DELETE /my_index
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"british,english",
"queen,monarch"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_synonyms",
"fielddata": true
}
}
}
}
}
POST my_index/test/1
{
"title" : "the british monarch"
}
GET my_index/_search
{
"query": {
"match": {
"title": "queen"
}
}
}
GET my_index/_search
{
"query": {
"match": {
"title": "queen"
}
},
"aggs": {
"queen_filter": {
"filter": {
"term": {
"title": "queen"
}
}
},
"monarch_filter": {
"filter": {
"term": {
"title": "monarch"
}
}
}
}
}
能否分享您为 status.synonym
字段定义的映射?
编辑:V2
过滤器输出为 0 的原因是 Elasticsearch 中的过滤器从不经过分析阶段。它用于精确匹配。
聚合中的标记 'wlh' 不会被转换为 'wellwell',这意味着它不会出现在倒排索引中。这是因为,在索引时间内,您的 'wlh' 被翻译成 'wellwell'。 为了实现你想要的,你必须将数据索引到一个单独的字段中并相应地调整你的过滤器。
您可以尝试类似的方法:
DELETE my_index
PUT /my_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"wlh,wellhead=>wellwell"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
},
"mappings": {
"job": {
"properties": {
"title": {
"type": "string",
"fields": {
"synonym": {
"type": "string",
"analyzer": "my_synonyms"
}
}
}
}
}
}
}
PUT my_index/job/1
{
"title":"wellhead smth else"
}
PUT my_index/job/2
{
"title":"wlh other stuff"
}
POST my_index/_search
{
"size": 0,
"query": {
"match": {
"title.synonym": {
"query": "wlh",
"operator": "and"
}
}
},
"aggs": {
"wlhAggs": {
"terms": {
"field": "title.synonym"
}
},
"wlhFilter": {
"filter": {
"term": {
"title": "wlh"
}
}
}
}
}
输出:
{
"aggregations": {
"wlhAggs": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "wellwell",
"doc_count": 2
},
{
"key": "else",
"doc_count": 1
},
{
"key": "other",
"doc_count": 1
},
{
"key": "smth",
"doc_count": 1
},
{
"key": "stuff",
"doc_count": 1
}
]
},
"wlhFilter": {
"doc_count": 1
}
}
}
希望对您有所帮助!!
因此,在下面@Byron Voorbach 的帮助和他的评论下,这是我的解决方案:
- 我创建了一个单独的字段,我在上面使用同义词分析器,因为 反对拥有 属性 字段 (mainfield.property).
- 最重要的问题是我的同义词被缩减了!我 例如,有 british,english => uk。将其更改为 british,english,uk 解决了我的问题,过滤器聚合是 返回正确数量的文档。
希望这对某人有所帮助,或者至少指明了正确的方向。
编辑: 哦,上帝赞美文档!我完全解决了过滤器 (S!) 聚合 (link here) 的问题。在过滤器配置中,我指定了查询的匹配类型并且它起作用了!结果是这样的:
"aggs" : {
"messages" : {
"filters" : {
"filters" : {
"status" : { "match" : { "cats.saurus" : "monarch" }},
"country" : { "match" : { "cats.saurus" : "british" }}
}
}
}
}