如何构建 Elasticsearch 以仅过滤 URL 子域?

How to construct Elasticsearch to filter only URL with subdomain?

我将 URL 存储为 Elasticsearch 中的一个字段。但是,我只想过滤在 url 中具有子域的文档。

例如

我希望我的搜索结果有

http://any-subdomain.example.com

但我不希望结果有

https://www.example.com

这在 Elasticsearch 查询中可行吗?

您尝试过 query_string 查询吗?例如,我用于 twitter 数据如下:

GET /twitter2/tweet/_search
{
    "query": {
        "query_string": {
           "default_field": "entities.media.url",
           "query": "https\:\/\/t.co\/* AND -https\:\/\/t.co\/6*"
        }
    },
    "_source": ["entities.media.url"]
}

为此搜索我的映射:

PUT /twitter2/tweet/_mapping
{
    "properties": {
        "entities": {
            "properties": {
                "media": {
                    "properties": {
                        "url": {
                            "type": "string",
                            "index": "not_analyzed"
                        }
                    }
                }
            }
        }
    }
}

您可以针对您的案例使用以下查询:

GET /your-index/your-type/_search
{
    "query": {
        "query_string": {
           "default_field": "url",
           "query": "http\:\/\/*.example.com AND -http\:\/\/www.example.com"
        }
    }
}

Note : you should know that you can get your result faster if you use something to handle while indexing your data as url and host. With elastic 5.x, you can use ingest node to manipulate your data like this. I will try to create a pipeline for this but you can check the doc for more information