如何在索引时停止在内容中存储特殊字符

Question

这是一个示例文档，包含以下几点：制药营销建筑 - 责任。马萨诸塞州 - 2020 年 8 月 13 日 -

如何在索引时从内容中删除特殊字符或非 ascii unicode 字符？我正在使用 ES 7.x 和风暴爬虫 1.17

Answer 1

似乎是对字符集的错误检测。您可以通过编写 custom parse filter 并在其中删除不需要的字符来在索引之前规范化内容。

Answer 2

如果编写自定义解析过滤器和规范化对您来说看起来很困难。您只需在分析器定义中添加 asciifolding token filter 即可将 non-ascii 字符转换为它们的 ascii 字符，如下所示

POST http://{{主机名}}:{{端口}}/_analyze

{
    "tokenizer": "standard",
    "filter": [
        "asciifolding"
    ],
    "text": "Pharmaceutical Marketing Building â responsibilities.Â Â Mass. â Aug. 13, 2020 âÂ"
}

并为您的文本生成标记。

{
    "tokens": [
        {
            "token": "Pharmaceutical",
            "start_offset": 0,
            "end_offset": 14,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "Marketing",
            "start_offset": 15,
            "end_offset": 24,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "Building",
            "start_offset": 25,
            "end_offset": 33,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "a",
            "start_offset": 34,
            "end_offset": 35,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "responsibilities.A",
            "start_offset": 36,
            "end_offset": 54,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "A",
            "start_offset": 55,
            "end_offset": 56,
            "type": "<ALPHANUM>",
            "position": 5
        },
        {
            "token": "Mass",
            "start_offset": 57,
            "end_offset": 61,
            "type": "<ALPHANUM>",
            "position": 6
        },
        {
            "token": "a",
            "start_offset": 63,
            "end_offset": 64,
            "type": "<ALPHANUM>",
            "position": 7
        },
        {
            "token": "Aug",
            "start_offset": 65,
            "end_offset": 68,
            "type": "<ALPHANUM>",
            "position": 8
        },
        {
            "token": "13",
            "start_offset": 70,
            "end_offset": 72,
            "type": "<NUM>",
            "position": 9
        },
        {
            "token": "2020",
            "start_offset": 74,
            "end_offset": 78,
            "type": "<NUM>",
            "position": 10
        },
        {
            "token": "aA",
            "start_offset": 79,
            "end_offset": 81,
            "type": "<ALPHANUM>",
            "position": 11
        }
    ]
}

如何在索引时停止在内容中存储特殊字符

How to stop storing special characters in content while indexing

elasticsearch

stormcrawler

elasticsearch-analyzers