Elasticsearch 令牌位置变化
Elasticsearch Token Position change
最近对 Elasticsearch 感兴趣 analyzer.I 了解什么是 token graph,start_offset,end_offset,position 和 positionLength。
索引架构
PUT synonym_graph_index
{
"settings": {
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"synonym_graph_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":["synonym_filter"]
}
},
"filter": {
"synonym_filter":
{
"type":"synonym_graph",
"synonyms":["wi fi => wifi,hotspot,fast network"]
}
}
}
},
"mappings": {
"properties": {
"text_field": {
"type": "text",
"analyzer": "synonym_graph_analyzer"
}
}
}
}
我在其中添加了一个文档。
POST synonym_graph_index/_analyze
{
"analyzer": "synonym_graph_analyzer"
, "text": "Airtel wi fi is up and down"
}
分析结果
{
"tokens" : [
{
"token" : "Airtel",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "wifi",
"start_offset" : 7,
"end_offset" : 12,
"type" : "SYNONYM",
"position" : 1,
"positionLength" : 2
},
{
"token" : "hotspot",
"start_offset" : 7,
"end_offset" : 12,
"type" : "SYNONYM",
"position" : 1,
"positionLength" : 2
},
{
"token" : "fast",
"start_offset" : 7,
"end_offset" : 12,
"type" : "SYNONYM",
"position" : 1
},
{
"token" : "network",
"start_offset" : 7,
"end_offset" : 12,
"type" : "SYNONYM",
"position" : 2
},
{
"token" : "is",
"start_offset" : 13,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "up",
"start_offset" : 16,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "and",
"start_offset" : 19,
"end_offset" : 22,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "down",
"start_offset" : 23,
"end_offset" : 27,
"type" : "<ALPHANUM>",
"position" : 6
}
]
}
为了更好地理解我做了 table。
通过使用上面的 table 我也制作了图表。
network
标记改变了它的 position.Did 它发生是因为我使用了标准标记器并且它分裂了 fast network
。还有一件事我想知道在某些情况下 positionlength就不提了。
- 是的,同义词只是将输入“替换”为输出,它们不会影响下游的处理(标记化、词干提取等)。
- 您的原始字符串“wi fi”有 2 个标记,但一些同义词(“hotspot”)是单个单词,因此它们有
positionLength
表示此标记占据 2 个位置。
最近对 Elasticsearch 感兴趣 analyzer.I 了解什么是 token graph,start_offset,end_offset,position 和 positionLength。
索引架构
PUT synonym_graph_index
{
"settings": {
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"synonym_graph_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":["synonym_filter"]
}
},
"filter": {
"synonym_filter":
{
"type":"synonym_graph",
"synonyms":["wi fi => wifi,hotspot,fast network"]
}
}
}
},
"mappings": {
"properties": {
"text_field": {
"type": "text",
"analyzer": "synonym_graph_analyzer"
}
}
}
}
我在其中添加了一个文档。
POST synonym_graph_index/_analyze
{
"analyzer": "synonym_graph_analyzer"
, "text": "Airtel wi fi is up and down"
}
分析结果
{
"tokens" : [
{
"token" : "Airtel",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "wifi",
"start_offset" : 7,
"end_offset" : 12,
"type" : "SYNONYM",
"position" : 1,
"positionLength" : 2
},
{
"token" : "hotspot",
"start_offset" : 7,
"end_offset" : 12,
"type" : "SYNONYM",
"position" : 1,
"positionLength" : 2
},
{
"token" : "fast",
"start_offset" : 7,
"end_offset" : 12,
"type" : "SYNONYM",
"position" : 1
},
{
"token" : "network",
"start_offset" : 7,
"end_offset" : 12,
"type" : "SYNONYM",
"position" : 2
},
{
"token" : "is",
"start_offset" : 13,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "up",
"start_offset" : 16,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "and",
"start_offset" : 19,
"end_offset" : 22,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "down",
"start_offset" : 23,
"end_offset" : 27,
"type" : "<ALPHANUM>",
"position" : 6
}
]
}
为了更好地理解我做了 table。
通过使用上面的 table 我也制作了图表。
network
标记改变了它的 position.Did 它发生是因为我使用了标准标记器并且它分裂了 fast network
。还有一件事我想知道在某些情况下 positionlength就不提了。
- 是的,同义词只是将输入“替换”为输出,它们不会影响下游的处理(标记化、词干提取等)。
- 您的原始字符串“wi fi”有 2 个标记,但一些同义词(“hotspot”)是单个单词,因此它们有
positionLength
表示此标记占据 2 个位置。