从 url 弹性搜索中删除特殊字符和单词
removing special characters and words from a url elasticsearch
我正在寻找一种方法来从 url.
中生成单词和特殊字符作为标记
例如。我有一个 url https://www.google.com/
我想在 elastic 中生成令牌,如 https、www、google、com、:、/、/、.、.、/
您可以使用 letter
分词器定义自定义分析器,如下所示:
PUT index3
{
"settings": {
"analysis": {
"analyzer": {
"my_email": {
"tokenizer": "letter",
"filter": [
"lowercase"
]
}
}
}
}
}
测试API:
POST index3/_analyze
{
"text": [
"https://www.google.com/"
],
"analyzer": "my_email"
}
输出:
{
"tokens" : [
{
"token" : "https",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "www",
"start_offset" : 8,
"end_offset" : 11,
"type" : "word",
"position" : 1
},
{
"token" : "google",
"start_offset" : 12,
"end_offset" : 18,
"type" : "word",
"position" : 2
},
{
"token" : "com",
"start_offset" : 19,
"end_offset" : 22,
"type" : "word",
"position" : 3
}
]
}
我正在寻找一种方法来从 url.
中生成单词和特殊字符作为标记例如。我有一个 url https://www.google.com/
我想在 elastic 中生成令牌,如 https、www、google、com、:、/、/、.、.、/
您可以使用 letter
分词器定义自定义分析器,如下所示:
PUT index3
{
"settings": {
"analysis": {
"analyzer": {
"my_email": {
"tokenizer": "letter",
"filter": [
"lowercase"
]
}
}
}
}
}
测试API:
POST index3/_analyze
{
"text": [
"https://www.google.com/"
],
"analyzer": "my_email"
}
输出:
{
"tokens" : [
{
"token" : "https",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "www",
"start_offset" : 8,
"end_offset" : 11,
"type" : "word",
"position" : 1
},
{
"token" : "google",
"start_offset" : 12,
"end_offset" : 18,
"type" : "word",
"position" : 2
},
{
"token" : "com",
"start_offset" : 19,
"end_offset" : 22,
"type" : "word",
"position" : 3
}
]
}