ElasticSearch 查询匹配不正确

Question

我正在尝试匹配基于 URL 字段的查询。我在下面有一个 InsertLink 方法，当有人在网页上添加新的 link 时，该方法会被触发。现在，如果任何 link 要添加前缀 "https://" 或 "http://"，它会自动匹配第一个（在这种情况下仅）带有 https:// 的项目或索引中的 "http://" 前缀。这是因为我的模型是使用 Uri 类型设置的吗？这是我的模型示例和 InsertLink 方法调试的屏幕截图。

我的模特：

public class SSOLink
{
    public string Name { get; set; }
    public Uri Url { get; set; }
    public string Owner { get; set; }

}

截图示例。

Answer 1

您需要使用 UAX_URL tokenizer 来搜索 URL 个字段。

您可以使用 UAX_URL 标记创建自定义分析器，并使用您现在使用的相同 match 查询来获得预期结果。

索引映射

{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "my_tokenizer"
                }
            },
            "tokenizer": {
                "my_tokenizer": {
                    "type": "uax_url_email",
                    "max_token_length": 5
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "url": {
                "type": "text",
                "analyzer": "my_analyzer"
            }
        }
    }
}

看起来在你的情况下，URL 字段正在使用 Elasticsearch 中的文本字段，它使用标准分析器并使用 _analyze API，你可以检查你的 URL字段。

使用标准分析器

POST _analyze/

{
    "text": "https://www.microsoft.com",
    "analyzer" : "standard"
}

代币

{
    "tokens": [
        {
            "token": "https",
            "start_offset": 0,
            "end_offset": 5,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "www.microsoft.com",
            "start_offset": 8,
            "end_offset": 25,
            "type": "<ALPHANUM>",
            "position": 1
        }
    ]
}

使用UAX_URL分词器

{
    "text": "https://www.microsoft.com",
    "tokenizer" : "uax_url_email"
}

并生成令牌

{
    "tokens": [
        {
            "token": "https://www.microsoft.com",
            "start_offset": 0,
            "end_offset": 25,
            "type": "<URL>",
            "position": 0
        }
    ]
}

ElasticSearch 查询匹配不正确

ElasticSearch query matching incorrectly

java

elasticsearch

elasticsearch-query

resthighlevelclient

使用标准分析器

使用UAX_URL分词器