ElasticSearch 搜索结果不佳
ElasticSearch search getting bad results
我是 ElasticSearch 的新手,在获取我认为不错的搜索结果时遇到了问题。我的 objective 是能够根据用户输入的短语搜索药物索引(6 个字段)。可以是一个或多个单词。我已经尝试了几种方法,但我将在下面概述到目前为止我发现的最好的方法。让我知道我做错了什么。我猜我遗漏了一些基本的东西。
这是我正在使用的字段的子集
...
"hits": [
{
"_index": "indexus2",
"_type": "Medication",
"_id": "17471",
"_score": 8.829264,
"_source": {
"SearchContents": " chew chewable oral po tylenol",
"MedShortDesc": "Tylenol PO Chew",
"MedLongDesc": "Tylenol Oral Chewable"
"GenericDesc": "ACETAMINOPHEN ORAL"
...
}
}
...
我搜索的字段使用了 Edge NGram 分析器。我正在使用 C# Nest 库进行索引
settings.Analysis.Tokenizers.Add("edgeNGram", new EdgeNGramTokenizer()
{
MaxGram = 50,
MinGram = 2,
TokenChars = new List<string>() { "letter", "digit" }
});
settings.Analysis.Analyzers.Add("edgeNGramAnalyzer", new CustomAnalyzer()
{
Filter = new string[] { "lowercase" },
Tokenizer = "edgeNGram"
});
我正在对相关字段使用 more_like_this 查询
GET indexus2/Medication/_search
{
"query": {
"more_like_this" : {
"fields" : ["MedShortDesc",
"MedLongDesc",
"GenericDesc",
"SearchContents"],
"like_text" : "vicodin",
"min_term_freq" : 1,
"max_query_terms" : 25,
"min_word_len": 2
}
}
}
问题是对于 'vicodin' 的搜索,我希望首先看到与完整作品的匹配,但我没有。以下是此查询结果的一个子集。 Vicodin 直到第 7 个结果才出现
"hits": [
{
"_index": "indexus2",
"_type": "Medication",
"_id": "31192",
"_score": 4.567309,
"_source": {
"SearchContents": " oral po victrelis",
"MedShortDesc": "Victrelis PO",
"MedLongDesc": "Victrelis Oral",
"RepresentativeRoutedGenericDesc": "BOCEPREVIR ORAL",
...
}
}
<5 more similar results>
{
"_index": "indexus2",
"_type": "Medication",
"_id": "26198",
"_score": 2.2836545,
"_source": {
"SearchContents": " (original 5 500 feeding mg strength) tube via vicodin",
"MedShortDesc": "Vicodin 5 mg-500 mg (Original Strength) via feeding tube",
"MedLongDesc": "Vicodin 5 mg-500 mg (Original Strength) via feeding tube",
"GenericDesc": "HYDROCODONE BITARTRATE/ACETAMINOPHEN ORAL",
...
}
}
字段映射
"OrderableMedLongDesc": {
"type": "string",
"analyzer": "edgeNGramAnalyzer"
},
"OrderableMedShortDesc": {
"type": "string",
"analyzer": "edgeNGramAnalyzer"
},
"RepresentativeRoutedGenericDesc": {
"type": "string",
"analyzer": "edgeNGramAnalyzer"
},
"SearchContents": {
"type": "string",
"analyzer": "edgeNGramAnalyzer"
},
这是 ES 为我的分析器设置显示的内容
"analyzer": {
"edgeNGramAnalyzer": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "edgeNGram"
}
},
"tokenizer": {
"edgeNGram": {
"min_gram": "2",
"type": "edgeNGram",
"max_gram": "50"
}
}
根据上述映射,edgeNGramAnalyzer
是字段的 search-analyzer,因此搜索查询也会得到 "edge ngrammed"。你可能不想要这个。
更改映射以仅将 index_analyzer
选项设置为 edgeNgramAnalyzer
。
然后 search_analyzer
将默认为 standard
。
示例:
"SearchContents": {
"type": "string",
"index_analyzer": "edgeNGramAnalyzer"
},
我是 ElasticSearch 的新手,在获取我认为不错的搜索结果时遇到了问题。我的 objective 是能够根据用户输入的短语搜索药物索引(6 个字段)。可以是一个或多个单词。我已经尝试了几种方法,但我将在下面概述到目前为止我发现的最好的方法。让我知道我做错了什么。我猜我遗漏了一些基本的东西。
这是我正在使用的字段的子集
...
"hits": [
{
"_index": "indexus2",
"_type": "Medication",
"_id": "17471",
"_score": 8.829264,
"_source": {
"SearchContents": " chew chewable oral po tylenol",
"MedShortDesc": "Tylenol PO Chew",
"MedLongDesc": "Tylenol Oral Chewable"
"GenericDesc": "ACETAMINOPHEN ORAL"
...
}
}
...
我搜索的字段使用了 Edge NGram 分析器。我正在使用 C# Nest 库进行索引
settings.Analysis.Tokenizers.Add("edgeNGram", new EdgeNGramTokenizer()
{
MaxGram = 50,
MinGram = 2,
TokenChars = new List<string>() { "letter", "digit" }
});
settings.Analysis.Analyzers.Add("edgeNGramAnalyzer", new CustomAnalyzer()
{
Filter = new string[] { "lowercase" },
Tokenizer = "edgeNGram"
});
我正在对相关字段使用 more_like_this 查询
GET indexus2/Medication/_search
{
"query": {
"more_like_this" : {
"fields" : ["MedShortDesc",
"MedLongDesc",
"GenericDesc",
"SearchContents"],
"like_text" : "vicodin",
"min_term_freq" : 1,
"max_query_terms" : 25,
"min_word_len": 2
}
}
}
问题是对于 'vicodin' 的搜索,我希望首先看到与完整作品的匹配,但我没有。以下是此查询结果的一个子集。 Vicodin 直到第 7 个结果才出现
"hits": [
{
"_index": "indexus2",
"_type": "Medication",
"_id": "31192",
"_score": 4.567309,
"_source": {
"SearchContents": " oral po victrelis",
"MedShortDesc": "Victrelis PO",
"MedLongDesc": "Victrelis Oral",
"RepresentativeRoutedGenericDesc": "BOCEPREVIR ORAL",
...
}
}
<5 more similar results>
{
"_index": "indexus2",
"_type": "Medication",
"_id": "26198",
"_score": 2.2836545,
"_source": {
"SearchContents": " (original 5 500 feeding mg strength) tube via vicodin",
"MedShortDesc": "Vicodin 5 mg-500 mg (Original Strength) via feeding tube",
"MedLongDesc": "Vicodin 5 mg-500 mg (Original Strength) via feeding tube",
"GenericDesc": "HYDROCODONE BITARTRATE/ACETAMINOPHEN ORAL",
...
}
}
字段映射
"OrderableMedLongDesc": {
"type": "string",
"analyzer": "edgeNGramAnalyzer"
},
"OrderableMedShortDesc": {
"type": "string",
"analyzer": "edgeNGramAnalyzer"
},
"RepresentativeRoutedGenericDesc": {
"type": "string",
"analyzer": "edgeNGramAnalyzer"
},
"SearchContents": {
"type": "string",
"analyzer": "edgeNGramAnalyzer"
},
这是 ES 为我的分析器设置显示的内容
"analyzer": {
"edgeNGramAnalyzer": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "edgeNGram"
}
},
"tokenizer": {
"edgeNGram": {
"min_gram": "2",
"type": "edgeNGram",
"max_gram": "50"
}
}
根据上述映射,edgeNGramAnalyzer
是字段的 search-analyzer,因此搜索查询也会得到 "edge ngrammed"。你可能不想要这个。
更改映射以仅将 index_analyzer
选项设置为 edgeNgramAnalyzer
。
然后 search_analyzer
将默认为 standard
。
示例:
"SearchContents": {
"type": "string",
"index_analyzer": "edgeNGramAnalyzer"
},