如何在索引时停止在内容中存储特殊字符
How to stop storing special characters in content while indexing
这是一个示例文档,包含以下几点:
制药
营销
建筑 -
责任。
马萨诸塞州 - 2020 年 8 月 13 日 -
如何在索引时从内容中删除特殊字符或非 ascii unicode 字符?我正在使用 ES 7.x 和风暴爬虫 1.17
似乎是对字符集的错误检测。您可以通过编写 custom parse filter 并在其中删除不需要的字符来在索引之前规范化内容。
如果编写自定义解析过滤器和规范化对您来说看起来很困难。您只需在分析器定义中添加 asciifolding token filter 即可将 non-ascii 字符转换为它们的 ascii 字符,如下所示
POST http://{{主机名}}:{{端口}}/_analyze
{
"tokenizer": "standard",
"filter": [
"asciifolding"
],
"text": "Pharmaceutical Marketing Building â responsibilities.  Mass. â Aug. 13, 2020 âÂ"
}
并为您的文本生成标记。
{
"tokens": [
{
"token": "Pharmaceutical",
"start_offset": 0,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "Marketing",
"start_offset": 15,
"end_offset": 24,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "Building",
"start_offset": 25,
"end_offset": 33,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "a",
"start_offset": 34,
"end_offset": 35,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "responsibilities.A",
"start_offset": 36,
"end_offset": 54,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "A",
"start_offset": 55,
"end_offset": 56,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "Mass",
"start_offset": 57,
"end_offset": 61,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "a",
"start_offset": 63,
"end_offset": 64,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "Aug",
"start_offset": 65,
"end_offset": 68,
"type": "<ALPHANUM>",
"position": 8
},
{
"token": "13",
"start_offset": 70,
"end_offset": 72,
"type": "<NUM>",
"position": 9
},
{
"token": "2020",
"start_offset": 74,
"end_offset": 78,
"type": "<NUM>",
"position": 10
},
{
"token": "aA",
"start_offset": 79,
"end_offset": 81,
"type": "<ALPHANUM>",
"position": 11
}
]
}
这是一个示例文档,包含以下几点: 制药 营销 建筑 - 责任。 马萨诸塞州 - 2020 年 8 月 13 日 -
如何在索引时从内容中删除特殊字符或非 ascii unicode 字符?我正在使用 ES 7.x 和风暴爬虫 1.17
似乎是对字符集的错误检测。您可以通过编写 custom parse filter 并在其中删除不需要的字符来在索引之前规范化内容。
如果编写自定义解析过滤器和规范化对您来说看起来很困难。您只需在分析器定义中添加 asciifolding token filter 即可将 non-ascii 字符转换为它们的 ascii 字符,如下所示
POST http://{{主机名}}:{{端口}}/_analyze
{
"tokenizer": "standard",
"filter": [
"asciifolding"
],
"text": "Pharmaceutical Marketing Building â responsibilities.  Mass. â Aug. 13, 2020 âÂ"
}
并为您的文本生成标记。
{
"tokens": [
{
"token": "Pharmaceutical",
"start_offset": 0,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "Marketing",
"start_offset": 15,
"end_offset": 24,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "Building",
"start_offset": 25,
"end_offset": 33,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "a",
"start_offset": 34,
"end_offset": 35,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "responsibilities.A",
"start_offset": 36,
"end_offset": 54,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "A",
"start_offset": 55,
"end_offset": 56,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "Mass",
"start_offset": 57,
"end_offset": 61,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "a",
"start_offset": 63,
"end_offset": 64,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "Aug",
"start_offset": 65,
"end_offset": 68,
"type": "<ALPHANUM>",
"position": 8
},
{
"token": "13",
"start_offset": 70,
"end_offset": 72,
"type": "<NUM>",
"position": 9
},
{
"token": "2020",
"start_offset": 74,
"end_offset": 78,
"type": "<NUM>",
"position": 10
},
{
"token": "aA",
"start_offset": 79,
"end_offset": 81,
"type": "<ALPHANUM>",
"position": 11
}
]
}