如何在 Azure 搜索中查询 url

Question

我在 Azure 搜索中存储了具有以下文件的文档，然后所有文件都是可搜索的文件。

url（示例：https://example.com/test.html or http://www.example.com/doc/doc1.html）
标题
内容

根据官方文档search document，我尝试通过url查询内容关键字是hotel但是失败了。

POST /indexes/hotels/docs/search?api-version=2017-11-11  
{  
  "search": "url:example.com AND hotel",  
  "searchMode": "all"  
}

更新：

我已经尝试使用标准的tokenizer，并且域名blog.xuite.net成功解析为token。

 "tokens": [
    {
        "token": "https",
        "startOffset": 0,
        "endOffset": 5,
        "position": 0
    },
    {
        "token": "blog.xuite.net",
        "startOffset": 8,
        "endOffset": 22,
        "position": 1
    },
    {
        "token": "yundestiny",
        "startOffset": 23,
        "endOffset": 33,
        "position": 2
    },
    {
        "token": "20050916",
        "startOffset": 34,
        "endOffset": 42,
        "position": 3
    },
 ]

为什么我可以通过 url:blog.xuite.net 搜索？

Answer 1

您可能想要尝试的其中一件事是将 custom analyzer 应用于包含此内容的字段。我实际上认为 uax_url_email 分词器很适合你的情况，但另一种选择是创建一个分析器，使用 Char Filters 对 // 和 / 等字符进行分词。

Answer 2

最后，我想出了通过 tokenizer = standard_v2 和 tokenFilters 使用 CustomAnalyzer = 限制令牌过滤器。以下是我的索引设置。

 "analyzers": [
    {
        "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
        "name": "domain_analyzer",
        "tokenizer": "standard_v2",
        "tokenFilters": [
            "my_limit"
        ],
        "charFilters": []
    }
],
"tokenizers": [],
"tokenFilters": [
    {
        "@odata.type": "#Microsoft.Azure.Search.LimitTokenFilter",
        "name": "my_limit",
        "maxTokenCount": 2,
        "consumeAllTokens": false
    }
],

通过使用此 CustomAnalyzer，url 字段例如

https://example.com/test.html

将仅作为 example.com 索引。

所以我可以通过 search=url:(example.com) AND {keyword}

进行搜索

如何在 Azure 搜索中查询 url

How to query url in azure search

azure-cognitive-search