在 Elasticsearch 中使用短语匹配时忽略查询字符串中的过滤词

Question

我正在使用自定义索引分析器删除一组特定的停用词。然后，我使用包含一些停用词的文本进行短语匹配查询。我希望停用词会从查询中过滤掉，但事实并非如此（并且任何不包含它们的文档都会从结果中排除）。

这是我正在尝试做的一个简化示例：

    #!/bin/bash
    
    export ELASTICSEARCH_ENDPOINT="http://localhost:9200"
    
    # Create index, with a custom analyzer to filter out the word 'foo'
    curl -XPUT "$ELASTICSEARCH_ENDPOINT/play" -d '{
        "settings": {
            "analysis": {
                "analyzer": {
                    "fooAnalyzer": {
                        "type": "custom",
                        "tokenizer": "letter",
                        "filter": [
                            "fooFilter"
                        ]
                    }
                },
                "filter": {
                    "fooFilter": {
                        "type": "stop",
                        "stopwords": [
                            "foo"
                        ]
                    }
                }
            }
        },
        "mappings": {
            "myDocument": {
                "properties": {
                    "myMessage": {
                        "analyzer": "fooAnalyzer",
                        "type": "string"
                    }
                }
            }
        }
    }'
    
    # Add sample document
    curl -XPOST "$ELASTICSEARCH_ENDPOINT/_bulk?refresh=true" -d '
    {"index":{"_index":"play","_type":"myDocument"}}
    {"myMessage":"bar baz"}
    '

如果我对该索引执行 phrase_match 搜索，在查询中间使用过滤停用词，我希望它匹配（因为 'foo' 应该被我们的分析器过滤掉).

    curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
    {
        "query": {
            "match": {
                "myMessage": {
                    "type": "phrase",
                    "query": "bar foo baz"
                }
            }
        }
    }
    '

但是，我没有得到任何结果。

有没有办法指示 Elasticsearch 在执行搜索之前对查询字符串进行标记化和过滤？

编辑 1：现在我更加困惑了。我之前看到，如果我的 query 在查询文本中间包含停用词，则短语匹配不起作用。现在，此外，如果 document 在查询文本中间包含停用词，我发现短语查询不起作用。这是一个最小的例子，仍然使用上面的映射。

    POST play/myDocument
    {
      "myMessage": "fib foo bar"  <---- remember that 'foo' is a stopword and is filtered out of analysis
    }

    GET play/_search
    {
        "query": {
            "match": {
                "myMessage": {
                    "type": "phrase",
                    "query": "fib bar"
                }
            }
        }
    }

此查询不匹配。我对此感到非常惊讶！我希望 foo 停止词被过滤掉并被忽略。

有关我期望如此的原因的示例，请参阅此查询：

    POST play/myDocument
    {
      "myMessage": "fib 123 bar"
    }

    GET play/_search
    {
        "query": {
            "match": {
                "myMessage": {
                    "type": "phrase",
                    "query": "fib bar"
                }
            }
        }
    }

这匹配，因为 '123' 被我的 'letter' 分词器过滤掉了。似乎短语匹配完全忽略了停用词过滤，并且好像这些标记一直在分析的字段中一样（即使它们没有出现在 _analyze 的标记列表中）。

我目前最好的解决方法：

使用我的自定义分析器针对我的文档的文本字符串调用 _analyze 端点。这将 return 来自原始文本字符串的标记，但为我删除了讨厌的停用词
仅使用标记将我的文本保存到文档"filtered"字段中

稍后，在查询时：

使用我的自定义分析器针对我的查询字符串调用 _analyze 端点以仅获取标记

使用过滤后的标记字符串针对文档的新 "filtered" 字段进行短语匹配查询

Answer 1

应该有效的解决方法：

使用我的自定义分析器针对我的查询字符串调用 _analyze 端点。这将 return 来自原始查询字符串的标记，但为我删除了讨厌的停用词
使用过滤后的标记进行短语匹配查询

但是，这显然需要为我的每个查询调用两次 Elasticsearch。如果可能，我想找到更好的解决方案。

Answer 2

事实证明，如果您想使用短语匹配，令牌过滤器来不及删除不需要的词。到那时，您的重要标记的 position 字段已被过滤标记的存在污染，并且短语匹配拒绝工作。

答案 - 在我们到达令牌过滤器级别之前进行过滤。我创建了一个 char_filter 来删除不需要的术语，短语匹配开始正常工作！

    PUT play 
    {
        "settings": {
            "analysis": {
                "analyzer": {
                    "fooAnalyzer": {
                        "type": "custom",
                        "tokenizer": "letter",
                        "char_filter": [
                            "fooFilter"
                        ]
                    }
                },
                "char_filter": {
                    "fooFilter": {
                        "type": "pattern_replace",
                        "pattern": "(foo)",
                        "replacement": ""
                    }
                }
            }
        },
        "mappings": {
            "myDocument": {
                "properties": {
                    "myMessage": {
                        "analyzer": "fooAnalyzer",
                        "type": "string"
                    }
                }
            }
        }
    }

查询：

    POST play/myDocument
    {
      "myMessage": "fib bar"
    }
    
    GET play/_search
    {
        "query": {
            "match": {
                "myMessage": {
                    "type": "phrase",
                    "query": "fib foo bar"
                }
            }
        }
    }

和

    POST play/myDocument
    {
      "myMessage": "fib foo bar"
    }
    
    GET play/_search
    {
        "query": {
            "match": {
                "myMessage": {
                    "type": "phrase",
                    "query": "fib bar"
                }
            }
        }
    }

现在都可以使用了！

Answer 3

解决方案

这是一个类似问题的替代解决方案——但是删除英语停用词并处理multi-value字段；在 v7.10 上测试。它不需要明确使用 char_filter，它使用 standard analyzer 和 english stop words 并使字段成为 text，因此它应该正确处理 match_phrases：

PUT play
    {
        "settings": {
            "analysis": {
                "analyzer": {
                    "phrase_analyzer": {
                        "type": "standard",
                        "stopwords": "_english_" //for my use case
                    }
                }
            }
        },
        "mappings": {
            // "myDocument" is not used in v7.x
            "properties": {
                "myMessage": {
                    "analyzer": "phrase_analyzer",
                    "type": "text" //changed to handle match_phrase
                }
            }
        }
    }

对于此演示数据：

    POST _bulk
    { "index": { "_index": "play", "_id": "1" } }
    { "myMessage": ["Guardian of the Galaxy"]}
    { "index": { "_index": "play", "_id": "2" } }
    { "myMessage": ["Ambassador of Peace", "Guardian of the Galaxy"]}
    { "index": { "_index": "play", "_id": "3" } }
    { "myMessage": ["Guardian of the Galaxy and Ambassador of Peace"]}
    { "index": { "_index": "play", "_id": "4" } }
    { "myMessage": ["Ambassador of Peace and Guardian of the Galaxy"]}
    { "index": { "_index": "play", "_id": "5" } }
    { "myMessage": ["Supreme Galaxy and All Living Beings Guardian"]}
    { "index": { "_index": "play", "_id": "6" } }
    { "myMessage": ["Guardian of the Sun", "Worker of the Galaxy"]}

查询 1：

    GET play/_search
    {
        "query": {
            "match_phrase": {
                "myMessage": {
                    "query": "guardian of the galaxy",
                    "slop": 99 //useful on multi-values text fields
                    //https://www.elastic.co/guide/en/elasticsearch/reference/7.10/position-increment-gap.html
                }
            }
        }
    }

应该 return 记录 1 到 5，因为每个至少有一个值匹配 "guardian" 或 "galaxy"；和文档 6 将不匹配，因为这些词中的每一个都匹配不同的值，但不相同（这就是我们使用 slop=99 的原因）。

查询 2：


    GET play/_search
    {
        "query": {
            "match_phrase": {
                "myMessage": {
                    "query": "\"guardian of the galaxy\"",
                    "slop": 99
                }
            }
        }
    }

应该 return 仅文档 1 到 4，因为（转义的）双引号强制每个值的子字符串完全匹配，而文档 5 在不同的位置有 2 个词。

说明

问题是您使用了 stop token filter 1 ...

Token filters are not allowed to change the position or character offsets of each token.

和一个 match_phrase 查询，但是 2...

The match_phrase query analyzes the text and creates a phrase query out of the analyzed text.

所以 position 在应用停止标记过滤器之前已经计算出来，match_phrase 依赖它来计算匹配。 '123' 正常工作，因为 letter tokenizer 确实定义了 position 1，所以 match_phrase 很高兴！

The tokenizer is also responsible for recording the order or position of each term.

例外情况 - 0.3% 是误报

在使用更大的数据种类测试此解决方案后，我发现了一些异常的误报——大约占 4k 搜索结果的 0.3%。在我的特殊情况下，我在 filter 中使用 match_phrase。要重现误报，我们只需调换第 6 项中值的顺序，这样单词 "Galaxy" 和 "Guardian" 看起来彼此接近：

    POST _bulk
    { "index": { "_index": "play", "_id": "7" } }
    { "myMessage": ["Worker of the Galaxy", "Guardian of the Sun"]}

之前的查询 1 也会 return 它，但显然不应该。我无法通过使用 Elasticsearch API 解决它，但通过以编程方式从查询 1 中删除停用词（请参阅下一个）.

查询 3：

    GET play/_search
    {
        "query": {
            "match_phrase": {
                "myMessage": {
                    "query": "guardian galaxy", //manually removed "of" and "the" stop words
                    "slop": 99 //useful on multi-values text fields
                    //https://www.elastic.co/guide/en/elasticsearch/reference/7.10/position-increment-gap.html
                }
            }
        }
    }

查看更多信息：

分词器与分词过滤器：Anatomy of an analyzer
Match phrase query
分词器和位置计算：Token graphs

在 Elasticsearch 中使用短语匹配时忽略查询字符串中的过滤词

Ignore filtered words from the query string when using phrase match in Elasticsearch

elasticsearch

match-phrase

elasticsearch-analyzers

解决方案

说明

例外情况 - 0.3% 是误报

查看更多信息：