2个词的模糊匹配

fuzzy matching of 2 words

这个:

 {
      ""query"": {
        ""match"": {
          ""attachment.content"": {
              ""query"": ""hello world"",
              ""minimum_should_match"": 2,
              ""fuzziness"": 1
          }
        }
      }
    }

意味着 return 项包含:

hello world
hello Vorld
pello world

换句话说,最大。一个字符是不同的。它似乎也 return 项目只包含:

hello

为什么要指定 minimum_should_match = 2 - 即强加 AND?

PS:

部分相关映射:

{
  "my_great_index" : {
    "mappings" : {
      "properties" : {
        "attachment" : {
          "properties" : {
            "author" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "containsMetadata" : {
              "type" : "boolean"
            },
            "content" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "content_length" : {
              "type" : "long"
            },
            "content_type" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "date" : {
              "type" : "date"
            },
            "detect_language" : {
              "type" : "boolean"
            },
            "indexed_chars" : {
              "type" : "long"
            },
            "keywords" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "language" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "name" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "title" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            }
          }
        },
        "something_else" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
        ....

PPS:

这是我在 C# 中创建索引的方式:

https://www.elastic.co/blog/the-future-of-attachments-for-elasticsearch-and-dotnet

public static void CreateIndex(ElasticClient client, string indexName)
{
    var createIndexResponse = client.Indices.Create(indexName, c => c
    .Settings(s => s
        .Analysis(a => a
        .Analyzers(ad => ad
            .Custom("windows_path_hierarchy_analyzer", ca => ca
            .Tokenizer("windows_path_hierarchy_tokenizer")
            )
        )
        .Tokenizers(t => t
            .PathHierarchy("windows_path_hierarchy_tokenizer", ph => ph
            .Delimiter('\')
            )
        )
        )
    )
    .Map<MyItem>(mp => mp
        .AutoMap()
        .Properties(ps => ps
        .Text(s => s
            .Name(n => n.Id)
            .Analyzer("windows_path_hierarchy_analyzer")
        )
        .Object<Attachment>(a => a
            .Name(n => n.Attachment)
            .AutoMap()
        )
        )
    )
    );

    var putPipelineResponse = client.Ingest.PutPipeline("attachments", p => p
    .Description("Document attachment pipeline")
    .Processors(pr => pr
        .Attachment<MyItem>(a => a
        .Field(f => f.Content)
        .TargetField(f => f.Attachment)
        )
        .Remove<MyItem>(r => r
        .Field(ff => ff
            .Field(f => f.Content)
        )
        )
    )
    );
}

我刚刚在 elastic-search 7.6 版上尝试了您的示例,它对我有用。你能提供你如何索引你的数据,即示例文档和你的 elasticsearch 版本吗?

此外,您提供的查询在语法上不正确。

字段较少的索引定义

{
    "mappings": {
        "properties": {
            "attachment": {
                "properties": {
                    "author": {
                        "type": "text"
                    },
                    "content": {
                        "type": "text"
                    }
                }
            }
        }
    }
}

索引了 3 个您期望的文档

{
    "attachment.author": "bar",
    "attachment.content": "pello world"
}

{
    "attachment.author": "bar",
    "attachment.content": "hello world"
}

{
    "attachment.author": "bar",
    "attachment.content": "hello vorld"
}

您提供的语法正确的相同搜索查询

{
    "query": {
        "match" : {
            "attachment.content" : {
                "query" : "hello world", --> properly closed quotes
                "minimum_should_match": 2,
                "fuzziness": 1
            }
        }
    }
}

搜索结果

 "hits": [
            {
                "_index": "fuzzy",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.9400072,
                "_source": {
                    "attachment.author": "foo",
                    "attachment.content": "hello world"
                }
            },
            {
                "_index": "fuzzy",
                "_type": "_doc",
                "_id": "2",
                "_score": 0.8460065,
                "_source": {
                    "attachment.author": "bar",
                    "attachment.content": "hello vorld"
                }
            },
            {
                "_index": "fuzzy",
                "_type": "_doc",
                "_id": "3",
                "_score": 0.8460065,
                "_source": {
                    "attachment.author": "bar",
                    "attachment.content": "pello world"
                }
            }
        ]

你的问题还有另一部分,即 只包含 hello 的文档出现在搜索结果中,尽管 minimum_should_match=2 也有效很好,我将另一个文档编入索引

{
    "attachment.author": "bar",
    "attachment.content": "my world" --> only world
}

同样的搜索查询 returns 之前只有 3 个文档,但是如果我们将 minimum_should_match 更改为 1,它 returns 所有 4 个文档。

{
                "_index": "fuzzy",
                "_type": "_doc",
                "_id": "1",
                "_score": 1.0498221,
                "_source": {
                    "attachment.author": "foo",
                    "attachment.content": "hello world"
                }
            },
            {
                "_index": "fuzzy",
                "_type": "_doc",
                "_id": "2",
                "_score": 0.9784871,
                "_source": {
                    "attachment.author": "bar",
                    "attachment.content": "hello vorld"
                }
            },
            {
                "_index": "fuzzy",
                "_type": "_doc",
                "_id": "3",
                "_score": 0.91119266,
                "_source": {
                    "attachment.author": "bar",
                    "attachment.content": "pello world"
                }
            },
            {
                "_index": "fuzzy",
                "_type": "_doc",
                "_id": "4",
                "_score": 0.35667494,
                "_source": {
                    "attachment.author": "bar",
                    "attachment.content": "my world" --> note last 4 doc
                }
            }
        ]