ElasticSearch - 如何过滤搜索分析中的仇恨词/侮辱

ElasticSearch - how to filter hate words / insults in search analyze

我正在尝试配置 ElasticSearch 7。

我配置了一些停用词,我以为它也包括那些词,但似乎不是这样...

最佳做法是什么?

我当前的设置如下:

'analysis' => [
    'filter' => [
        ...
        'english_stop' => [
            'type' => 'stop',
            'stopwords' => '_english_'
        ],
        'english_stemmer' => [
            'type' => 'stemmer',
            'language' => 'english'
        ],
        'english_possessive_stemmer' => [
            'type' => 'stemmer',
            'language' => 'possessive_english'
        ]
        ...
    ],
    'analyzer' => [
        'rebuilt_english' => [
            'type' => 'custom',
            'tokenizer' => 'standard',
            'filter' => [
                ...
                'english_possessive_stemmer',
                'lowercase',
                'english_stop',
                'english_stemmer'
            ]
        ]
    ]
]

谢谢

A) 如果您想消除包含不良词的结果——即在搜索响应中完全忽略它们——您可以添加 index alias.

首先像往常一样创建索引:

PUT dirty-index
{
  "settings": {
    "analysis": {
      "filter": { ... },
      "analyzer": { ... }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "rebuilt_english"
      }
    }
  }
}

添加一份“安全”文件和一份“不安全”文件:

POST dirty-index/_doc
{
  "content": "some regular text"
}

POST dirty-index/_doc
{
  "content": "some taboo text with bad words"
}

保存 filtered 索引别名,从而创建原始索引的安全“视图”:

PUT dirty-index/_alias/dirty-index-filtered
{
  "filter": {
    "bool": {
      "must_not": {
        "terms": {
          "content": ["taboo"]
        }
      }
    }
  }
}

taboo 只是众多坏词之一,取自:https://www.cs.cmu.edu/~biglou/resources/bad-words.txt

瞧瞧——别名只包含“安全”文档。通过以下方式验证:

GET dirty-index-filtered/_search
{
  "query": {
    "match_all": {}
  }
}

B) 如果您想在 索引 之前对 select 条款进行 CENSOR,您可以通过 ingest pipeline.

存储管道:

PUT _ingest/pipeline/my_data_cleanser
{
  "description": "Runs a doc thru a censoring replacer...",
  "processors": [
    {
      "script": {
        "source": """
          def bad_words = ['taboo', 'damn'];  // list all of 'em
          def CENSORED = '*CENSORED*';
          def content_copy = ctx.content;
          
          for (word in bad_words) {
            if (content_copy.contains(word)) {
              content_copy = content_copy.replace(word, CENSORED)
            }
          }
          
          ctx.content = content_copy;
        """
      }
    }
  ]
}

然后在索引文档时将其作为 URL 参数引用:

                     |
                     v________
POST dirty-index/_doc?pipeline=my_data_cleanser
{
  "content": "some text with damn bad words"
}

这将导致:

some text with *CENSORED* bad words

C) 如果您想在分析步骤中捕获并替换 select 个单词,您可以使用 pattern_replace token filter.

PUT dirty-index
{
  "settings": {
    "analysis": {
      "filter": {
        "bad_word_replacer": {
          "type": "pattern_replace",
          "pattern": "((taboo)|(damn))",      <--- not sure how this'll scale to potentially hundreds of words
          "replacement": "*CENSORED*"
        }
      },
      "analyzer": {
        "rebuilt_english": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "bad_word_replacer"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "rebuilt_english"
      }
    }
  }
}

请注意,这只会影响 analyzed 字段,但 NOT stored值:

POST dirty-index/_analyze?filter_path=tokens.token&format=yaml
{
  "field": "content",
  "text": ["some taboo text"]
}

生成的代币将是:

tokens:
- token: "some"
- token: "*CENSORED*"
- token: "text"

但它们不会有太大用处,因为如果我正确理解你的用例,你不需要禁用 搜索 来查找仇恨词——你需要禁用他们的检索?