ElasticSearch - 如何过滤搜索分析中的仇恨词/侮辱
ElasticSearch - how to filter hate words / insults in search analyze
我正在尝试配置 ElasticSearch 7。
我配置了一些停用词,我以为它也包括那些词,但似乎不是这样...
最佳做法是什么?
我当前的设置如下:
'analysis' => [
'filter' => [
...
'english_stop' => [
'type' => 'stop',
'stopwords' => '_english_'
],
'english_stemmer' => [
'type' => 'stemmer',
'language' => 'english'
],
'english_possessive_stemmer' => [
'type' => 'stemmer',
'language' => 'possessive_english'
]
...
],
'analyzer' => [
'rebuilt_english' => [
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => [
...
'english_possessive_stemmer',
'lowercase',
'english_stop',
'english_stemmer'
]
]
]
]
谢谢
A) 如果您想消除包含不良词的结果——即在搜索响应中完全忽略它们——您可以添加 index alias.
首先像往常一样创建索引:
PUT dirty-index
{
"settings": {
"analysis": {
"filter": { ... },
"analyzer": { ... }
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "rebuilt_english"
}
}
}
}
添加一份“安全”文件和一份“不安全”文件:
POST dirty-index/_doc
{
"content": "some regular text"
}
POST dirty-index/_doc
{
"content": "some taboo text with bad words"
}
保存 filtered 索引别名,从而创建原始索引的安全“视图”:
PUT dirty-index/_alias/dirty-index-filtered
{
"filter": {
"bool": {
"must_not": {
"terms": {
"content": ["taboo"]
}
}
}
}
}
taboo
只是众多坏词之一,取自:https://www.cs.cmu.edu/~biglou/resources/bad-words.txt
瞧瞧——别名只包含“安全”文档。通过以下方式验证:
GET dirty-index-filtered/_search
{
"query": {
"match_all": {}
}
}
B) 如果您想在 索引 之前对 select 条款进行 CENSOR,您可以通过 ingest pipeline.
存储管道:
PUT _ingest/pipeline/my_data_cleanser
{
"description": "Runs a doc thru a censoring replacer...",
"processors": [
{
"script": {
"source": """
def bad_words = ['taboo', 'damn']; // list all of 'em
def CENSORED = '*CENSORED*';
def content_copy = ctx.content;
for (word in bad_words) {
if (content_copy.contains(word)) {
content_copy = content_copy.replace(word, CENSORED)
}
}
ctx.content = content_copy;
"""
}
}
]
}
然后在索引文档时将其作为 URL 参数引用:
|
v________
POST dirty-index/_doc?pipeline=my_data_cleanser
{
"content": "some text with damn bad words"
}
这将导致:
some text with *CENSORED* bad words
C) 如果您想在分析步骤中捕获并替换 select 个单词,您可以使用 pattern_replace
token filter.
PUT dirty-index
{
"settings": {
"analysis": {
"filter": {
"bad_word_replacer": {
"type": "pattern_replace",
"pattern": "((taboo)|(damn))", <--- not sure how this'll scale to potentially hundreds of words
"replacement": "*CENSORED*"
}
},
"analyzer": {
"rebuilt_english": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"bad_word_replacer"
]
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "rebuilt_english"
}
}
}
}
请注意,这只会影响 analyzed 字段,但 NOT stored值:
POST dirty-index/_analyze?filter_path=tokens.token&format=yaml
{
"field": "content",
"text": ["some taboo text"]
}
生成的代币将是:
tokens:
- token: "some"
- token: "*CENSORED*"
- token: "text"
但它们不会有太大用处,因为如果我正确理解你的用例,你不需要禁用 搜索 来查找仇恨词——你需要禁用他们的检索?
我正在尝试配置 ElasticSearch 7。
我配置了一些停用词,我以为它也包括那些词,但似乎不是这样...
最佳做法是什么?
我当前的设置如下:
'analysis' => [
'filter' => [
...
'english_stop' => [
'type' => 'stop',
'stopwords' => '_english_'
],
'english_stemmer' => [
'type' => 'stemmer',
'language' => 'english'
],
'english_possessive_stemmer' => [
'type' => 'stemmer',
'language' => 'possessive_english'
]
...
],
'analyzer' => [
'rebuilt_english' => [
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => [
...
'english_possessive_stemmer',
'lowercase',
'english_stop',
'english_stemmer'
]
]
]
]
谢谢
A) 如果您想消除包含不良词的结果——即在搜索响应中完全忽略它们——您可以添加 index alias.
首先像往常一样创建索引:
PUT dirty-index
{
"settings": {
"analysis": {
"filter": { ... },
"analyzer": { ... }
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "rebuilt_english"
}
}
}
}
添加一份“安全”文件和一份“不安全”文件:
POST dirty-index/_doc
{
"content": "some regular text"
}
POST dirty-index/_doc
{
"content": "some taboo text with bad words"
}
保存 filtered 索引别名,从而创建原始索引的安全“视图”:
PUT dirty-index/_alias/dirty-index-filtered
{
"filter": {
"bool": {
"must_not": {
"terms": {
"content": ["taboo"]
}
}
}
}
}
taboo
只是众多坏词之一,取自:https://www.cs.cmu.edu/~biglou/resources/bad-words.txt
瞧瞧——别名只包含“安全”文档。通过以下方式验证:
GET dirty-index-filtered/_search
{
"query": {
"match_all": {}
}
}
B) 如果您想在 索引 之前对 select 条款进行 CENSOR,您可以通过 ingest pipeline.
存储管道:
PUT _ingest/pipeline/my_data_cleanser
{
"description": "Runs a doc thru a censoring replacer...",
"processors": [
{
"script": {
"source": """
def bad_words = ['taboo', 'damn']; // list all of 'em
def CENSORED = '*CENSORED*';
def content_copy = ctx.content;
for (word in bad_words) {
if (content_copy.contains(word)) {
content_copy = content_copy.replace(word, CENSORED)
}
}
ctx.content = content_copy;
"""
}
}
]
}
然后在索引文档时将其作为 URL 参数引用:
|
v________
POST dirty-index/_doc?pipeline=my_data_cleanser
{
"content": "some text with damn bad words"
}
这将导致:
some text with *CENSORED* bad words
C) 如果您想在分析步骤中捕获并替换 select 个单词,您可以使用 pattern_replace
token filter.
PUT dirty-index
{
"settings": {
"analysis": {
"filter": {
"bad_word_replacer": {
"type": "pattern_replace",
"pattern": "((taboo)|(damn))", <--- not sure how this'll scale to potentially hundreds of words
"replacement": "*CENSORED*"
}
},
"analyzer": {
"rebuilt_english": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"bad_word_replacer"
]
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "rebuilt_english"
}
}
}
}
请注意,这只会影响 analyzed 字段,但 NOT stored值:
POST dirty-index/_analyze?filter_path=tokens.token&format=yaml
{
"field": "content",
"text": ["some taboo text"]
}
生成的代币将是:
tokens:
- token: "some"
- token: "*CENSORED*"
- token: "text"
但它们不会有太大用处,因为如果我正确理解你的用例,你不需要禁用 搜索 来查找仇恨词——你需要禁用他们的检索?