如何确保将语言分析应用于 WordDelimiterTokenFilter 生成的标记
How can I ensure language analysis is applied to a token generated by WordDelimiterTokenFilter
这个问题是我在修复 FEMMES.COM 未正确标记化 ()
后面临的新情况
失败的测试用例:#FEMMES2017 应该标记为 Femmes,Femme,2017。
很可能我使用 MappingCharFilter 的方法不正确,实际上只是一个创可贴。让这个失败的测试用例通过的正确方法是什么?
当前索引配置
"analyzers": [
{
"@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "text_language_search_custom_analyzer",
"tokenizer": "text_language_search_custom_analyzer_ms_tokenizer",
"tokenFilters": [
"lowercase",
"text_synonym_token_filter",
"asciifolding",
"language_word_delim_token_filter"
],
"charFilters": [
"html_strip",
"replace_punctuation_with_comma"
]
},
{
"@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "text_exact_search_Index_custom_analyzer",
"tokenizer": "text_exact_search_Index_custom_analyzer_tokenizer",
"tokenFilters": [
"lowercase",
"asciifolding"
],
"charFilters": []
}
],
"tokenizers": [
{
"@odata.type": "#Microsoft.Azure.Search.MicrosoftLanguageStemmingTokenizer",
"name": "text_language_search_custom_analyzer_ms_tokenizer",
"maxTokenLength": 300,
"isSearchTokenizer": false,
"language": "french"
},
{
"@odata.type": "#Microsoft.Azure.Search.StandardTokenizerV2",
"name": "text_exact_search_Index_custom_analyzer_tokenizer",
"maxTokenLength": 300
}
],
"tokenFilters": [
{
"@odata.type": "#Microsoft.Azure.Search.SynonymTokenFilter",
"name": "text_synonym_token_filter",
"synonyms": [
"ca => ça",
"yeux => oeil",
"oeufs,oeuf,Œuf,Œufs,œuf,œufs",
"etre,ete"
],
"ignoreCase": true,
"expand": true
},
{
"@odata.type": "#Microsoft.Azure.Search.WordDelimiterTokenFilter",
"name": "language_word_delim_token_filter",
"generateWordParts": true,
"generateNumberParts": true,
"catenateWords": false,
"catenateNumbers": false,
"catenateAll": false,
"splitOnCaseChange": true,
"preserveOriginal": false,
"splitOnNumerics": true,
"stemEnglishPossessive": true,
"protectedWords": []
}
],
"charFilters": [
{
"@odata.type": "#Microsoft.Azure.Search.MappingCharFilter",
"name": "replace_punctuation_with_comma",
"mappings": [
"#=>,",
"$=>,",
"€=>,",
"£=>,",
"%=>,",
"&=>,",
"+=>,",
"/=>,",
"==>,",
"<=>,",
">=>,",
"@=>,",
"_=>,",
"µ=>,",
"§=>,",
"¤=>,",
"°=>,",
"!=>,",
"?=>,",
"\"=>,",
"'=>,",
"`=>,",
"~=>,",
"^=>,",
".=>,",
":=>,",
";=>,",
"(=>,",
")=>,",
"[=>,",
"]=>,",
"{=>,",
"}=>,",
"*=>,",
"-=>,"
]
}
]
分析API调用
{
"analyzer": "text_language_search_custom_analyzer",
"text": "#femmes2017"
}
分析API响应
{
"@odata.context": "https://one-adscope-search-eu-prod.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult",
"tokens": [
{
"token": "femmes",
"startOffset": 1,
"endOffset": 7,
"position": 0
},
{
"token": "2017",
"startOffset": 7,
"endOffset": 11,
"position": 1
}
]
}
输入文本由分析器的组件按顺序处理:char 过滤器 -> tokenizer -> 标记过滤器。在您的情况下,标记器在 WordDelimiter 标记过滤器处理标记之前执行词形还原。遗憾的是,Microsoft 词干分析器和词形还原器不能用作您可以在 WordDelimiter 标记过滤器之后应用的独立标记过滤器。您将需要添加另一个标记过滤器,该过滤器将根据您的要求规范化 WordDelimiter 标记过滤器的输出。您只关心这种情况,您可以将 SynonymsTokenFilter 移动到分析器链的末尾并将 femmes 映射到 femme。这显然不是一个很好的解决方法,因为它非常特定于您正在处理的数据。希望我提供的信息能帮助您找到更通用的解决方案。
这个问题是我在修复 FEMMES.COM 未正确标记化 (
失败的测试用例:#FEMMES2017 应该标记为 Femmes,Femme,2017。
很可能我使用 MappingCharFilter 的方法不正确,实际上只是一个创可贴。让这个失败的测试用例通过的正确方法是什么?
当前索引配置
"analyzers": [
{
"@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "text_language_search_custom_analyzer",
"tokenizer": "text_language_search_custom_analyzer_ms_tokenizer",
"tokenFilters": [
"lowercase",
"text_synonym_token_filter",
"asciifolding",
"language_word_delim_token_filter"
],
"charFilters": [
"html_strip",
"replace_punctuation_with_comma"
]
},
{
"@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "text_exact_search_Index_custom_analyzer",
"tokenizer": "text_exact_search_Index_custom_analyzer_tokenizer",
"tokenFilters": [
"lowercase",
"asciifolding"
],
"charFilters": []
}
],
"tokenizers": [
{
"@odata.type": "#Microsoft.Azure.Search.MicrosoftLanguageStemmingTokenizer",
"name": "text_language_search_custom_analyzer_ms_tokenizer",
"maxTokenLength": 300,
"isSearchTokenizer": false,
"language": "french"
},
{
"@odata.type": "#Microsoft.Azure.Search.StandardTokenizerV2",
"name": "text_exact_search_Index_custom_analyzer_tokenizer",
"maxTokenLength": 300
}
],
"tokenFilters": [
{
"@odata.type": "#Microsoft.Azure.Search.SynonymTokenFilter",
"name": "text_synonym_token_filter",
"synonyms": [
"ca => ça",
"yeux => oeil",
"oeufs,oeuf,Œuf,Œufs,œuf,œufs",
"etre,ete"
],
"ignoreCase": true,
"expand": true
},
{
"@odata.type": "#Microsoft.Azure.Search.WordDelimiterTokenFilter",
"name": "language_word_delim_token_filter",
"generateWordParts": true,
"generateNumberParts": true,
"catenateWords": false,
"catenateNumbers": false,
"catenateAll": false,
"splitOnCaseChange": true,
"preserveOriginal": false,
"splitOnNumerics": true,
"stemEnglishPossessive": true,
"protectedWords": []
}
],
"charFilters": [
{
"@odata.type": "#Microsoft.Azure.Search.MappingCharFilter",
"name": "replace_punctuation_with_comma",
"mappings": [
"#=>,",
"$=>,",
"€=>,",
"£=>,",
"%=>,",
"&=>,",
"+=>,",
"/=>,",
"==>,",
"<=>,",
">=>,",
"@=>,",
"_=>,",
"µ=>,",
"§=>,",
"¤=>,",
"°=>,",
"!=>,",
"?=>,",
"\"=>,",
"'=>,",
"`=>,",
"~=>,",
"^=>,",
".=>,",
":=>,",
";=>,",
"(=>,",
")=>,",
"[=>,",
"]=>,",
"{=>,",
"}=>,",
"*=>,",
"-=>,"
]
}
]
分析API调用
{
"analyzer": "text_language_search_custom_analyzer",
"text": "#femmes2017"
}
分析API响应
{
"@odata.context": "https://one-adscope-search-eu-prod.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult",
"tokens": [
{
"token": "femmes",
"startOffset": 1,
"endOffset": 7,
"position": 0
},
{
"token": "2017",
"startOffset": 7,
"endOffset": 11,
"position": 1
}
]
}
输入文本由分析器的组件按顺序处理:char 过滤器 -> tokenizer -> 标记过滤器。在您的情况下,标记器在 WordDelimiter 标记过滤器处理标记之前执行词形还原。遗憾的是,Microsoft 词干分析器和词形还原器不能用作您可以在 WordDelimiter 标记过滤器之后应用的独立标记过滤器。您将需要添加另一个标记过滤器,该过滤器将根据您的要求规范化 WordDelimiter 标记过滤器的输出。您只关心这种情况,您可以将 SynonymsTokenFilter 移动到分析器链的末尾并将 femmes 映射到 femme。这显然不是一个很好的解决方法,因为它非常特定于您正在处理的数据。希望我提供的信息能帮助您找到更通用的解决方案。