使用 ElasticSearch 按特殊字符搜索
Search by Special Character using ElasticSearch
我在 VB.NET
项目中使用 ElasticSearch
。正常搜索工作正常,即通过任何单词。但是,现在根据要求,我还想按特殊字符搜索,即 ?
。我正在使用 ?
作为常规搜索,但它无法正常工作。
代码:
client.CreateIndex(Function(d) d.Analysis(Function(z) z.Analyzers(Function(a) a.Add("nGram_analyzer", Get_nGram_analyzer()).
Add("whitespace_analyzer", Get_whitespace_analyzer()).
Add("autocmp", New Nest.CustomAnalyzer() With {.Tokenizer = "edgeNGram", .Filter = {"lowercase"}})).
Tokenizers(Function(t) t.Add("edgeNGram", New Nest.EdgeNGramTokenizer With {.MinGram = 1, .MaxGram = 20})).
TokenFilters(Function(t) t.Add("nGram_filter", Get_nGram_filter()))).
Index(Of view_Article).AddMapping(Of view_Article)(ArticleMapping)
Private Shared Function Get_nGram_filter() As NgramFilter
Return New NgramFilter With {
.MinGram = 1,
.MaxGram = 20,
.token_chars = New List(Of String) From {"letter", "digit", "punctuation", "symbol"}
}
End Function
Private Shared Function Get_nGram_analyzer() As CustomAnalyzer
Return New CustomAnalyzer() With {
.Tokenizer = "whitespace",
.Filter = New List(Of String)() From {"lowercase", "asciifolding", "nGram_filter"}
}
End Function
Private Shared Function Get_whitespace_analyzer() As CustomAnalyzer
Return New CustomAnalyzer() With {
.Tokenizer = "whitespace",
.Filter = New List(Of String)() From {"lowercase", "asciifolding"}
}
End Function
搜索查询:
"query": {
"query_string": {
"query": "\?",
"fields": [
"title"
],
"default_operator": "and",
"analyze_wildcard": true
}
}
注:我要多方搜索。即关键字,关键字+特殊字符,或只是特殊字符。
根据与@jeeten 的讨论更改我的答案,@Nishant 给出的答案也可以,但存在以下功能性和非功能性问题:
功能问题:
- 搜索中只允许
?
和 /
特殊字符,而使用它将允许搜索所有标点符号。
非功能性问题:
这会导致 3 个字段以不同的格式索引,这会增加磁盘上的索引大小,也会给内存带来更大的压力,因为 Elasticsearch 会缓存倒排索引以获得更好的搜索性能。
同样,搜索需要三个不同的字段都搜索,再次搜索更多字段会导致性能问题。
令牌在title
字段的三个字段中重复。
我的解决方案
为了解决上述功能和非功能需求,我使用 [pattern_capture][1]
token-filter 仅索引 ?
和 /
,它还使用 "preserve_original": true,
来索引支持像 foo?
这样的搜索。
我也在索引 2 个字段并仅在两个字段上搜索以提高性能。
索引定义
{
"settings": {
"analysis": {
"filter": {
"splcharfilter": {
"type": "pattern_capture",
"preserve_original": true,
"patterns": [
"([?/])" --> extendable for future requirments.
]
}
},
"analyzer": {
"splcharanalyzer": {
"tokenizer": "keyword",
"filter": [
"splcharfilter",
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"fields": {
"splchar": {
"type": "text",
"analyzer": "splcharanalyzer"
}
}
}
}
}
}
搜索查询
{
"query": {
"query_string": {
"query": "\?", --> change this according to queries.
"fields": ["title", "title.splchar"] --> noyte only 2 fields
}
}
}
搜索结果
"hits": [
{
"_index": "pattern-capture",
"_type": "_doc",
"_id": "2",
"_score": 1.0341108,
"_source": {
"title": "Are you ready to change the climate?"
}
},
{
"_index": "pattern-capture",
"_type": "_doc",
"_id": "4",
"_score": 1.0341108,
"_source": {
"title": "What are the effects of direct public transfers on social solidarity?"
}
}
]
P.S:- 没有提及所有搜索查询及其输出以使答案简短,但任何人都可以索引和更改搜索查询并且它按预期工作。
以下面为例来自聊天作为基础:
Some example titles:
title: Climate: The case of Nigerian agriculture
title: Are you ready to change the climate?
title: A literature review with a particular focus on the school staff
title: What are the effects of direct public transfers on social solidarity?
title: Community-Led Practical and/or Social Support Interventions for Adults Living at Home.
If I search by only "?" then it should return the 2nd and 4th results.
If I search by "/" then it should return only last record.
Search by climate then 1st and 2nd results.
Search by climate? then 1st, 2nd, and 4th results.
该解决方案需要为以下情况创建分析器:
- 搜索特殊字符。我将这些视为标点符号,例如
/
、?
等
- 搜索关键字和特殊字符。例如
climate?
- 要搜索关键字。例如
climate
对于 案例 1 我们将使用 pattern tokenizer 但我们将使用模式来提取特殊字符作为标记,而不是使用模式来分割,为此我们设置"group": 0
在定义分词器时。例如对于文本 xyz a/b pq?
,生成的令牌将是 /
、?
对于 案例 2,我们将创建自定义分析器,其中 filter
作为 lowercase
(不区分大小写),tokenizer
作为 whitespace
(保留带有关键字的特殊字符)。
例如对于文本 How many?
,生成的令牌将是 how
、many?
对于案例 3,我们将使用 standard
分析器,这是默认分析器。
下一步是为 title
创建子字段。 title
将是 text
类型,默认情况下将具有 standard
分析器。此映射 属性 将有两个类型为 text
的子字段 withSplChar
和为 case 2 (ci_whitespace
) 创建的分析器,splChars
类型 text
和为 案例 1 创建的分析器 (splchar
)
现在让我们看看上面的操作:
PUT test
{
"settings": {
"analysis": {
"tokenizer": {
"splchar": {
"type": "pattern",
"pattern": "\p{Punct}",
"group": 0
}
},
"analyzer": {
"splchar": {
"tokenizer": "splchar"
},
"ci_whitespace": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "whitespace"
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"fields": {
"withSplChar": {
"type": "text",
"analyzer": "ci_whitespace"
},
"splChars": {
"type": "text",
"analyzer": "splchar"
}
}
}
}
}
}
现在让我们像上面的例子一样索引文档:
POST test/_bulk
{"index":{"_id":"1"}}
{"title":"Climate: The case of Nigerian agriculture"}
{"index":{"_id":"2"}}
{"title":"Are you ready to change the climate?"}
{"index":{"_id":"3"}}
{"title":"A literature review with a particular focus on the school staff"}
{"index":{"_id":"4"}}
{"title":"What are the effects of direct public transfers on social solidarity?"}
{"index":{"_id":"5"}}
{"title":"Community-Led Practical and/or Social Support Interventions for Adults Living at Home."}
搜索 ?
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.8025915,
"_source" : {
"title" : "Are you ready to change the climate?"
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "4",
"_score" : 0.8025915,
"_source" : {
"title" : "What are the effects of direct public transfers on social solidarity?"
}
}
]
结果:
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.8025915,
"_source" : {
"title" : "Are you ready to change the climate?"
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "4",
"_score" : 0.8025915,
"_source" : {
"title" : "What are the effects of direct public transfers on social solidarity?"
}
}
]
搜索 climate
POST test/_search
{
"query": {
"query_string": {
"query": "climate",
"fields": ["title", "title.withSplChar", "title.splChars"]
}
}
}
结果:
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0341107,
"_source" : {
"title" : "Climate: The case of Nigerian agriculture"
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.98455274,
"_source" : {
"title" : "Are you ready to change the climate?"
}
}
]
搜索 climate?
POST test/_search
{
"query": {
"query_string": {
"query": "climate\?",
"fields": ["title", "title.withSplChar", "title.splChars"]
}
}
}
结果:
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.5366155,
"_source" : {
"title" : "Are you ready to change the climate?"
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0341107,
"_source" : {
"title" : "Climate: The case of Nigerian agriculture"
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "4",
"_score" : 0.8025915,
"_source" : {
"title" : "What are the effects of direct public transfers on social solidarity?"
}
}
]
我在 VB.NET
项目中使用 ElasticSearch
。正常搜索工作正常,即通过任何单词。但是,现在根据要求,我还想按特殊字符搜索,即 ?
。我正在使用 ?
作为常规搜索,但它无法正常工作。
代码:
client.CreateIndex(Function(d) d.Analysis(Function(z) z.Analyzers(Function(a) a.Add("nGram_analyzer", Get_nGram_analyzer()).
Add("whitespace_analyzer", Get_whitespace_analyzer()).
Add("autocmp", New Nest.CustomAnalyzer() With {.Tokenizer = "edgeNGram", .Filter = {"lowercase"}})).
Tokenizers(Function(t) t.Add("edgeNGram", New Nest.EdgeNGramTokenizer With {.MinGram = 1, .MaxGram = 20})).
TokenFilters(Function(t) t.Add("nGram_filter", Get_nGram_filter()))).
Index(Of view_Article).AddMapping(Of view_Article)(ArticleMapping)
Private Shared Function Get_nGram_filter() As NgramFilter
Return New NgramFilter With {
.MinGram = 1,
.MaxGram = 20,
.token_chars = New List(Of String) From {"letter", "digit", "punctuation", "symbol"}
}
End Function
Private Shared Function Get_nGram_analyzer() As CustomAnalyzer
Return New CustomAnalyzer() With {
.Tokenizer = "whitespace",
.Filter = New List(Of String)() From {"lowercase", "asciifolding", "nGram_filter"}
}
End Function
Private Shared Function Get_whitespace_analyzer() As CustomAnalyzer
Return New CustomAnalyzer() With {
.Tokenizer = "whitespace",
.Filter = New List(Of String)() From {"lowercase", "asciifolding"}
}
End Function
搜索查询:
"query": {
"query_string": {
"query": "\?",
"fields": [
"title"
],
"default_operator": "and",
"analyze_wildcard": true
}
}
注:我要多方搜索。即关键字,关键字+特殊字符,或只是特殊字符。
根据与@jeeten 的讨论更改我的答案,@Nishant 给出的答案也可以,但存在以下功能性和非功能性问题:
功能问题:
- 搜索中只允许
?
和/
特殊字符,而使用它将允许搜索所有标点符号。
非功能性问题:
这会导致 3 个字段以不同的格式索引,这会增加磁盘上的索引大小,也会给内存带来更大的压力,因为 Elasticsearch 会缓存倒排索引以获得更好的搜索性能。
同样,搜索需要三个不同的字段都搜索,再次搜索更多字段会导致性能问题。
令牌在
title
字段的三个字段中重复。
我的解决方案
为了解决上述功能和非功能需求,我使用 [pattern_capture][1]
token-filter 仅索引 ?
和 /
,它还使用 "preserve_original": true,
来索引支持像 foo?
这样的搜索。
我也在索引 2 个字段并仅在两个字段上搜索以提高性能。
索引定义
{
"settings": {
"analysis": {
"filter": {
"splcharfilter": {
"type": "pattern_capture",
"preserve_original": true,
"patterns": [
"([?/])" --> extendable for future requirments.
]
}
},
"analyzer": {
"splcharanalyzer": {
"tokenizer": "keyword",
"filter": [
"splcharfilter",
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"fields": {
"splchar": {
"type": "text",
"analyzer": "splcharanalyzer"
}
}
}
}
}
}
搜索查询
{
"query": {
"query_string": {
"query": "\?", --> change this according to queries.
"fields": ["title", "title.splchar"] --> noyte only 2 fields
}
}
}
搜索结果
"hits": [
{
"_index": "pattern-capture",
"_type": "_doc",
"_id": "2",
"_score": 1.0341108,
"_source": {
"title": "Are you ready to change the climate?"
}
},
{
"_index": "pattern-capture",
"_type": "_doc",
"_id": "4",
"_score": 1.0341108,
"_source": {
"title": "What are the effects of direct public transfers on social solidarity?"
}
}
]
P.S:- 没有提及所有搜索查询及其输出以使答案简短,但任何人都可以索引和更改搜索查询并且它按预期工作。
以下面为例来自聊天作为基础:
Some example titles: title: Climate: The case of Nigerian agriculture title: Are you ready to change the climate? title: A literature review with a particular focus on the school staff title: What are the effects of direct public transfers on social solidarity? title: Community-Led Practical and/or Social Support Interventions for Adults Living at Home. If I search by only "?" then it should return the 2nd and 4th results. If I search by "/" then it should return only last record. Search by climate then 1st and 2nd results. Search by climate? then 1st, 2nd, and 4th results.
该解决方案需要为以下情况创建分析器:
- 搜索特殊字符。我将这些视为标点符号,例如
/
、?
等 - 搜索关键字和特殊字符。例如
climate?
- 要搜索关键字。例如
climate
对于 案例 1 我们将使用 pattern tokenizer 但我们将使用模式来提取特殊字符作为标记,而不是使用模式来分割,为此我们设置"group": 0
在定义分词器时。例如对于文本 xyz a/b pq?
,生成的令牌将是 /
、?
对于 案例 2,我们将创建自定义分析器,其中 filter
作为 lowercase
(不区分大小写),tokenizer
作为 whitespace
(保留带有关键字的特殊字符)。
例如对于文本 How many?
,生成的令牌将是 how
、many?
对于案例 3,我们将使用 standard
分析器,这是默认分析器。
下一步是为 title
创建子字段。 title
将是 text
类型,默认情况下将具有 standard
分析器。此映射 属性 将有两个类型为 text
的子字段 withSplChar
和为 case 2 (ci_whitespace
) 创建的分析器,splChars
类型 text
和为 案例 1 创建的分析器 (splchar
)
现在让我们看看上面的操作:
PUT test
{
"settings": {
"analysis": {
"tokenizer": {
"splchar": {
"type": "pattern",
"pattern": "\p{Punct}",
"group": 0
}
},
"analyzer": {
"splchar": {
"tokenizer": "splchar"
},
"ci_whitespace": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "whitespace"
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"fields": {
"withSplChar": {
"type": "text",
"analyzer": "ci_whitespace"
},
"splChars": {
"type": "text",
"analyzer": "splchar"
}
}
}
}
}
}
现在让我们像上面的例子一样索引文档:
POST test/_bulk
{"index":{"_id":"1"}}
{"title":"Climate: The case of Nigerian agriculture"}
{"index":{"_id":"2"}}
{"title":"Are you ready to change the climate?"}
{"index":{"_id":"3"}}
{"title":"A literature review with a particular focus on the school staff"}
{"index":{"_id":"4"}}
{"title":"What are the effects of direct public transfers on social solidarity?"}
{"index":{"_id":"5"}}
{"title":"Community-Led Practical and/or Social Support Interventions for Adults Living at Home."}
搜索 ?
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.8025915,
"_source" : {
"title" : "Are you ready to change the climate?"
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "4",
"_score" : 0.8025915,
"_source" : {
"title" : "What are the effects of direct public transfers on social solidarity?"
}
}
]
结果:
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.8025915,
"_source" : {
"title" : "Are you ready to change the climate?"
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "4",
"_score" : 0.8025915,
"_source" : {
"title" : "What are the effects of direct public transfers on social solidarity?"
}
}
]
搜索 climate
POST test/_search
{
"query": {
"query_string": {
"query": "climate",
"fields": ["title", "title.withSplChar", "title.splChars"]
}
}
}
结果:
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0341107,
"_source" : {
"title" : "Climate: The case of Nigerian agriculture"
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.98455274,
"_source" : {
"title" : "Are you ready to change the climate?"
}
}
]
搜索 climate?
POST test/_search
{
"query": {
"query_string": {
"query": "climate\?",
"fields": ["title", "title.withSplChar", "title.splChars"]
}
}
}
结果:
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.5366155,
"_source" : {
"title" : "Are you ready to change the climate?"
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0341107,
"_source" : {
"title" : "Climate: The case of Nigerian agriculture"
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "4",
"_score" : 0.8025915,
"_source" : {
"title" : "What are the effects of direct public transfers on social solidarity?"
}
}
]