如何确保将语言分析应用于 WordDelimiterTokenFilter 生成的标记

How can I ensure language analysis is applied to a token generated by WordDelimiterTokenFilter

这个问题是我在修复 FEMMES.COM 未正确标记化 ()

后面临的新情况

失败的测试用例:#FEMMES2017 应该标记为 Femmes,Femme,2017。

很可能我使用 MappingCharFilter 的方法不正确,实际上只是一个创可贴。让这个失败的测试用例通过的正确方法是什么?

当前索引配置

  "analyzers": [
    {
      "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
      "name": "text_language_search_custom_analyzer",
      "tokenizer": "text_language_search_custom_analyzer_ms_tokenizer",
      "tokenFilters": [
        "lowercase",
        "text_synonym_token_filter",
        "asciifolding",
        "language_word_delim_token_filter"
      ],
      "charFilters": [
        "html_strip",
        "replace_punctuation_with_comma"
      ]
    },
    {
      "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
      "name": "text_exact_search_Index_custom_analyzer",
      "tokenizer": "text_exact_search_Index_custom_analyzer_tokenizer",
      "tokenFilters": [
        "lowercase",
        "asciifolding"
      ],
      "charFilters": []
    }
  ],
  "tokenizers": [
    {
      "@odata.type": "#Microsoft.Azure.Search.MicrosoftLanguageStemmingTokenizer",
      "name": "text_language_search_custom_analyzer_ms_tokenizer",
      "maxTokenLength": 300,
      "isSearchTokenizer": false,
      "language": "french"
    },
    {
      "@odata.type": "#Microsoft.Azure.Search.StandardTokenizerV2",
      "name": "text_exact_search_Index_custom_analyzer_tokenizer",
      "maxTokenLength": 300
    }
  ],
  "tokenFilters": [
    {
      "@odata.type": "#Microsoft.Azure.Search.SynonymTokenFilter",
      "name": "text_synonym_token_filter",
      "synonyms": [
        "ca => ça",
        "yeux => oeil",
        "oeufs,oeuf,Œuf,Œufs,œuf,œufs",
        "etre,ete"
      ],
      "ignoreCase": true,
      "expand": true
    },
    {
      "@odata.type": "#Microsoft.Azure.Search.WordDelimiterTokenFilter",
      "name": "language_word_delim_token_filter",
      "generateWordParts": true,
      "generateNumberParts": true,
      "catenateWords": false,
      "catenateNumbers": false,
      "catenateAll": false,
      "splitOnCaseChange": true,
      "preserveOriginal": false,
      "splitOnNumerics": true,
      "stemEnglishPossessive": true,
      "protectedWords": []
    }
  ],
  "charFilters": [
    {
      "@odata.type": "#Microsoft.Azure.Search.MappingCharFilter",
      "name": "replace_punctuation_with_comma",
      "mappings": [
        "#=>,",
        "$=>,",
        "€=>,",
        "£=>,",
        "%=>,",
        "&=>,",
        "+=>,",
        "/=>,",
        "==>,",
        "<=>,",
        ">=>,",
        "@=>,",
        "_=>,",
        "µ=>,",
        "§=>,",
        "¤=>,",
        "°=>,",
        "!=>,",
        "?=>,",
        "\"=>,",
        "'=>,",
        "`=>,",
        "~=>,",
        "^=>,",
        ".=>,",
        ":=>,",
        ";=>,",
        "(=>,",
        ")=>,",
        "[=>,",
        "]=>,",
        "{=>,",
        "}=>,",
        "*=>,",
        "-=>,"
      ]
    }
  ]

分析API调用

{
  "analyzer": "text_language_search_custom_analyzer",
  "text": "#femmes2017"
}

分析API响应

{
  "@odata.context": "https://one-adscope-search-eu-prod.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult",
  "tokens": [
    {
      "token": "femmes",
      "startOffset": 1,
      "endOffset": 7,
      "position": 0
    },
    {
      "token": "2017",
      "startOffset": 7,
      "endOffset": 11,
      "position": 1
    }
  ]
}

输入文本由分析器的组件按顺序处理:char 过滤器 -> tokenizer -> 标记过滤器。在您的情况下,标记器在 WordDelimiter 标记过滤器处理标记之前执行词形还原。遗憾的是,Microsoft 词干分析器和词形还原器不能用作您可以在 WordDelimiter 标记过滤器之后应用的独立标记过滤器。您将需要添加另一个标记过滤器,该过滤器将根据您的要求规范化 WordDelimiter 标记过滤器的输出。您只关心这种情况,您可以将 SynonymsTokenFilter 移动到分析器链的末尾并将 femmes 映射到 femme。这显然不是一个很好的解决方法,因为它非常特定于您正在处理的数据。希望我提供的信息能帮助您找到更通用的解决方案。