2个词的模糊匹配
fuzzy matching of 2 words
这个:
{
""query"": {
""match"": {
""attachment.content"": {
""query"": ""hello world"",
""minimum_should_match"": 2,
""fuzziness"": 1
}
}
}
}
意味着 return 项包含:
hello world
hello Vorld
pello world
换句话说,最大。一个字符是不同的。它似乎也 return 项目只包含:
hello
为什么要指定 minimum_should_match = 2 - 即强加 AND?
PS:
部分相关映射:
{
"my_great_index" : {
"mappings" : {
"properties" : {
"attachment" : {
"properties" : {
"author" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"containsMetadata" : {
"type" : "boolean"
},
"content" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"content_length" : {
"type" : "long"
},
"content_type" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"date" : {
"type" : "date"
},
"detect_language" : {
"type" : "boolean"
},
"indexed_chars" : {
"type" : "long"
},
"keywords" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"language" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"title" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
},
"something_else" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
....
PPS:
这是我在 C# 中创建索引的方式:
https://www.elastic.co/blog/the-future-of-attachments-for-elasticsearch-and-dotnet
public static void CreateIndex(ElasticClient client, string indexName)
{
var createIndexResponse = client.Indices.Create(indexName, c => c
.Settings(s => s
.Analysis(a => a
.Analyzers(ad => ad
.Custom("windows_path_hierarchy_analyzer", ca => ca
.Tokenizer("windows_path_hierarchy_tokenizer")
)
)
.Tokenizers(t => t
.PathHierarchy("windows_path_hierarchy_tokenizer", ph => ph
.Delimiter('\')
)
)
)
)
.Map<MyItem>(mp => mp
.AutoMap()
.Properties(ps => ps
.Text(s => s
.Name(n => n.Id)
.Analyzer("windows_path_hierarchy_analyzer")
)
.Object<Attachment>(a => a
.Name(n => n.Attachment)
.AutoMap()
)
)
)
);
var putPipelineResponse = client.Ingest.PutPipeline("attachments", p => p
.Description("Document attachment pipeline")
.Processors(pr => pr
.Attachment<MyItem>(a => a
.Field(f => f.Content)
.TargetField(f => f.Attachment)
)
.Remove<MyItem>(r => r
.Field(ff => ff
.Field(f => f.Content)
)
)
)
);
}
我刚刚在 elastic-search 7.6 版上尝试了您的示例,它对我有用。你能提供你如何索引你的数据,即示例文档和你的 elasticsearch 版本吗?
此外,您提供的查询在语法上不正确。
字段较少的索引定义
{
"mappings": {
"properties": {
"attachment": {
"properties": {
"author": {
"type": "text"
},
"content": {
"type": "text"
}
}
}
}
}
}
索引了 3 个您期望的文档
{
"attachment.author": "bar",
"attachment.content": "pello world"
}
{
"attachment.author": "bar",
"attachment.content": "hello world"
}
{
"attachment.author": "bar",
"attachment.content": "hello vorld"
}
您提供的语法正确的相同搜索查询
{
"query": {
"match" : {
"attachment.content" : {
"query" : "hello world", --> properly closed quotes
"minimum_should_match": 2,
"fuzziness": 1
}
}
}
}
搜索结果
"hits": [
{
"_index": "fuzzy",
"_type": "_doc",
"_id": "1",
"_score": 0.9400072,
"_source": {
"attachment.author": "foo",
"attachment.content": "hello world"
}
},
{
"_index": "fuzzy",
"_type": "_doc",
"_id": "2",
"_score": 0.8460065,
"_source": {
"attachment.author": "bar",
"attachment.content": "hello vorld"
}
},
{
"_index": "fuzzy",
"_type": "_doc",
"_id": "3",
"_score": 0.8460065,
"_source": {
"attachment.author": "bar",
"attachment.content": "pello world"
}
}
]
你的问题还有另一部分,即 只包含 hello
的文档出现在搜索结果中,尽管 minimum_should_match=2
也有效很好,我将另一个文档编入索引
{
"attachment.author": "bar",
"attachment.content": "my world" --> only world
}
同样的搜索查询 returns 之前只有 3 个文档,但是如果我们将 minimum_should_match
更改为 1
,它 returns 所有 4 个文档。
{
"_index": "fuzzy",
"_type": "_doc",
"_id": "1",
"_score": 1.0498221,
"_source": {
"attachment.author": "foo",
"attachment.content": "hello world"
}
},
{
"_index": "fuzzy",
"_type": "_doc",
"_id": "2",
"_score": 0.9784871,
"_source": {
"attachment.author": "bar",
"attachment.content": "hello vorld"
}
},
{
"_index": "fuzzy",
"_type": "_doc",
"_id": "3",
"_score": 0.91119266,
"_source": {
"attachment.author": "bar",
"attachment.content": "pello world"
}
},
{
"_index": "fuzzy",
"_type": "_doc",
"_id": "4",
"_score": 0.35667494,
"_source": {
"attachment.author": "bar",
"attachment.content": "my world" --> note last 4 doc
}
}
]
这个:
{
""query"": {
""match"": {
""attachment.content"": {
""query"": ""hello world"",
""minimum_should_match"": 2,
""fuzziness"": 1
}
}
}
}
意味着 return 项包含:
hello world
hello Vorld
pello world
换句话说,最大。一个字符是不同的。它似乎也 return 项目只包含:
hello
为什么要指定 minimum_should_match = 2 - 即强加 AND?
PS:
部分相关映射:
{
"my_great_index" : {
"mappings" : {
"properties" : {
"attachment" : {
"properties" : {
"author" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"containsMetadata" : {
"type" : "boolean"
},
"content" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"content_length" : {
"type" : "long"
},
"content_type" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"date" : {
"type" : "date"
},
"detect_language" : {
"type" : "boolean"
},
"indexed_chars" : {
"type" : "long"
},
"keywords" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"language" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"title" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
},
"something_else" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
....
PPS:
这是我在 C# 中创建索引的方式:
https://www.elastic.co/blog/the-future-of-attachments-for-elasticsearch-and-dotnet
public static void CreateIndex(ElasticClient client, string indexName)
{
var createIndexResponse = client.Indices.Create(indexName, c => c
.Settings(s => s
.Analysis(a => a
.Analyzers(ad => ad
.Custom("windows_path_hierarchy_analyzer", ca => ca
.Tokenizer("windows_path_hierarchy_tokenizer")
)
)
.Tokenizers(t => t
.PathHierarchy("windows_path_hierarchy_tokenizer", ph => ph
.Delimiter('\')
)
)
)
)
.Map<MyItem>(mp => mp
.AutoMap()
.Properties(ps => ps
.Text(s => s
.Name(n => n.Id)
.Analyzer("windows_path_hierarchy_analyzer")
)
.Object<Attachment>(a => a
.Name(n => n.Attachment)
.AutoMap()
)
)
)
);
var putPipelineResponse = client.Ingest.PutPipeline("attachments", p => p
.Description("Document attachment pipeline")
.Processors(pr => pr
.Attachment<MyItem>(a => a
.Field(f => f.Content)
.TargetField(f => f.Attachment)
)
.Remove<MyItem>(r => r
.Field(ff => ff
.Field(f => f.Content)
)
)
)
);
}
我刚刚在 elastic-search 7.6 版上尝试了您的示例,它对我有用。你能提供你如何索引你的数据,即示例文档和你的 elasticsearch 版本吗?
此外,您提供的查询在语法上不正确。
字段较少的索引定义
{
"mappings": {
"properties": {
"attachment": {
"properties": {
"author": {
"type": "text"
},
"content": {
"type": "text"
}
}
}
}
}
}
索引了 3 个您期望的文档
{
"attachment.author": "bar",
"attachment.content": "pello world"
}
{
"attachment.author": "bar",
"attachment.content": "hello world"
}
{
"attachment.author": "bar",
"attachment.content": "hello vorld"
}
您提供的语法正确的相同搜索查询
{
"query": {
"match" : {
"attachment.content" : {
"query" : "hello world", --> properly closed quotes
"minimum_should_match": 2,
"fuzziness": 1
}
}
}
}
搜索结果
"hits": [
{
"_index": "fuzzy",
"_type": "_doc",
"_id": "1",
"_score": 0.9400072,
"_source": {
"attachment.author": "foo",
"attachment.content": "hello world"
}
},
{
"_index": "fuzzy",
"_type": "_doc",
"_id": "2",
"_score": 0.8460065,
"_source": {
"attachment.author": "bar",
"attachment.content": "hello vorld"
}
},
{
"_index": "fuzzy",
"_type": "_doc",
"_id": "3",
"_score": 0.8460065,
"_source": {
"attachment.author": "bar",
"attachment.content": "pello world"
}
}
]
你的问题还有另一部分,即 只包含 hello
的文档出现在搜索结果中,尽管 minimum_should_match=2
也有效很好,我将另一个文档编入索引
{
"attachment.author": "bar",
"attachment.content": "my world" --> only world
}
同样的搜索查询 returns 之前只有 3 个文档,但是如果我们将 minimum_should_match
更改为 1
,它 returns 所有 4 个文档。
{
"_index": "fuzzy",
"_type": "_doc",
"_id": "1",
"_score": 1.0498221,
"_source": {
"attachment.author": "foo",
"attachment.content": "hello world"
}
},
{
"_index": "fuzzy",
"_type": "_doc",
"_id": "2",
"_score": 0.9784871,
"_source": {
"attachment.author": "bar",
"attachment.content": "hello vorld"
}
},
{
"_index": "fuzzy",
"_type": "_doc",
"_id": "3",
"_score": 0.91119266,
"_source": {
"attachment.author": "bar",
"attachment.content": "pello world"
}
},
{
"_index": "fuzzy",
"_type": "_doc",
"_id": "4",
"_score": 0.35667494,
"_source": {
"attachment.author": "bar",
"attachment.content": "my world" --> note last 4 doc
}
}
]