在 elasticsearch 中合并字段

combine fields in elasticsearch

假设我在 SQL 中有一个 table,我将两个字段合并为一个

A    |     B
-----|--------         => select A+' '+ B as Name => BMW X-Series
BMW  |   3-Series
BMW  |     X3

我将它转储到临时文件 table 中,然后在临时文件 table 上进行通配符搜索,returns 结果与计数

select Name,count(Name) as frequency from Table where Name like '%3%' group by Name

     Name        Frequency
     ------------------  
    BMW 3-Series |  1
    BMW  X3      |  1

鉴于 A 和 B 是单独的字段,现在我如何实现相同的弹性搜索。

我试过这个:

{ "query":{
      "query_string":{
          "fields":["A","B"],
          "query":"3"
      }
      }, "aggs": {
    "count": {
      "terms": {
        "field": "A"
      },
      "aggs": {
        "count": {
          "terms": {
            "field": "B"
          }

        }
      }
    }
  }

}

如何在查询中添加正则表达式

SQL 和 Elasticsearch 之间的一个主要区别是,默认情况下,字符串字段在索引时间 被分析 ,并且您可以控制它们的分析方式 Analyzers.

默认分析器,the Standard Analyzer, will produce tokens from the input and store these in an inverted index. You can see what tokens would be generated for a given input by using the Analyze API:

curl -XPOST "http://localhost:9200/_analyze?analyzer=standard" -d'
{
  text : "3-Series"
}'

产生输出

{
  "tokens": [
    {
      "token": "3",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<NUM>",
      "position": 0
    },
    {
      "token": "series",
      "start_offset": 2,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

了解这一点,使用在搜索时进行分析的搜索查询 例如the Query String Query, there is no need for regular expression queries or wildcard queries if 你分析输入以某种方式支持您的用例。

您可以决定在一个字段中索引 "BMW 3-Series" 并使用 multi_fields 以不同的方式分析它,或者像您一样将值保留在不同的字段中并在两个字段中搜索。

这是一个让您入门的示例。鉴于我们有以下 POCO

public class Car
{
    public string Make { get; set; }
    public string Model { get; set; }
}

我们可以设置如下索引

var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200"));
var carsIndex = "cars";
var connectionSettings = new ConnectionSettings(pool)
        .DefaultIndex(carsIndex);

var client = new ElasticClient(connectionSettings);

client.CreateIndex(carsIndex, ci => ci
    .Settings(s => s
        .Analysis(analysis => analysis
            .Tokenizers(tokenizers => tokenizers
                .Pattern("model-tokenizer", p => p.Pattern(@"\W+"))
            )
            .TokenFilters(tokenfilters => tokenfilters
                .WordDelimiter("model-words", wd => wd
                    .PreserveOriginal()
                    .SplitOnNumerics()
                    .GenerateNumberParts()
                    .GenerateWordParts()
                )
            )
            .Analyzers(analyzers => analyzers
                .Custom("model-analyzer", c => c
                    .Tokenizer("model-tokenizer")
                    .Filters("model-words", "lowercase")
                )
            )
        )
    )
    .Mappings(m => m
        .Map<Car>(mm => mm
            .AutoMap()
            .Properties(p => p
                .String(s => s
                    .Name(n => n.Model)
                    .Analyzer("model-analyzer")
                )
            )
        )
    )
);

我们创建一个 cars 索引并创建一个自定义分析器以用于 Model 字段。这个自定义分析器将输入分成任何非单词字符的标记,然后使用标记过滤器将每个标记拆分为数字字符以生成一个保留原始标记的标记,代表数字部分的标记和一个标记代表单词部分的。最后,所有标记都小写了。

我们可以测试 model-analyzer 将对我们的输入做什么,看看它是否适合我们的需要

curl -XPOST "http://localhost:9200/cars/_analyze?analyzer=model-analyzer" -d'
{
  text : "X3"
}'

产生

{
  "tokens": [
    {
      "token": "x3",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "x",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "3",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 1
    }
  ]
}

curl -XPOST "http://localhost:9200/cars/_analyze?analyzer=model-analyzer" -d'
{
  text : "3-Series"
}'

产生

{
  "tokens": [
    {
      "token": "3",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "series",
      "start_offset": 2,
      "end_offset": 8,
      "type": "word",
      "position": 1
    }
  ]
}

这看起来很适合手头的问题。现在,如果我们索引一些文档并执行搜索,我们应该得到我们正在寻找的结果

client.Index<Car>(new Car { Make = "BMW", Model = "3-Series" });
client.Index<Car>(new Car { Make = "BMW", Model = "X3" });

// refresh the index so that documents are available to search
client.Refresh(carsIndex);

client.Search<Car>(s => s
    .Query(q => q
        .QueryString(qs => qs
            .Fields(f => f
                .Field(c => c.Make)
                .Field(c => c.Model)
            )
            .Query("3")
        )
    )
);

产生以下结果

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.058849156,
    "hits" : [ {
      "_index" : "cars",
      "_type" : "car",
      "_id" : "AVTbhENDDGlNKQ4qnluJ",
      "_score" : 0.058849156,
      "_source" : {
        "make" : "BMW",
        "model" : "3-Series"
      }
    }, {
      "_index" : "cars",
      "_type" : "car",
      "_id" : "AVTbhEOXDGlNKQ4qnluK",
      "_score" : 0.058849156,
      "_source" : {
        "make" : "BMW",
        "model" : "X3"
      }
    } ]
  }
}

希望这给了你一些想法。