从 url 弹性搜索中删除特殊字符和单词

removing special characters and words from a url elasticsearch

我正在寻找一种方法来从 url.

中生成单词和特殊字符作为标记

例如。我有一个 url https://www.google.com/

我想在 elastic 中生成令牌,如 https、www、google、com、:、/、/、.、.、/

您可以使用 letter 分词器定义自定义分析器,如下所示:

PUT index3
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_email": {
          "tokenizer": "letter",
          "filter": [
            "lowercase"       
          ]
        }
      }
    }
  }
}

测试API:

POST index3/_analyze
{
  "text": [
    "https://www.google.com/"
  ],
  "analyzer": "my_email"
  
}

输出:

{
  "tokens" : [
    {
      "token" : "https",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "www",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "google",
      "start_offset" : 12,
      "end_offset" : 18,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "com",
      "start_offset" : 19,
      "end_offset" : 22,
      "type" : "word",
      "position" : 3
    }
  ]
}