在 ElasticSearch NEST 中创建自定义分词器
Creating a custom tokenizer in ElasticSearch NEST
我在 ES 2.5 中有一个自定义 class 以下内容:
Title
DataSources
Content
运行 搜索没问题,除了中间字段 - built/indexed 使用分隔符“|”。
ex: "|4|7|8|9|10|12|14|19|20|21|22|23|29|30"
我需要构建一个查询,在所有字段中匹配一些,并且在 DataSource 字段中至少匹配一个数字。
总结一下我目前的情况:
QueryBase query = new SimpleQueryStringQuery
{
//DefaultOperator = !operatorOR ? Operator.And : Operator.Or,
Fields = LearnAboutFields.FULLTEXT,
Analyzer = "standard",
Query = searchWords.ToLower()
};
_boolQuery.Must = new QueryContainer[] {query};
这就是搜索词查询。
foreach (var datasource in dataSources)
{
// Add DataSources with an OR
queryContainer |= new WildcardQuery { Field = LearnAboutFields.DATASOURCE, Value = string.Format("*{0}*", datasource) };
}
// Add this Boolean Clause to our outer clause with an AND
_boolQuery.Filter = new QueryContainer[] {queryContainer};
}
那是数据源查询。可以有多个数据源。
它不起作用,并且 returns 在添加了过滤器查询的结果上。我想我需要在 tokenizer/analyzer 上做一些工作,但我对 ES 的了解还不够多,无法解决这个问题。
编辑:Per Val 在下面的评论我试图像这样重新编码索引器:
_elasticClientWrapper.CreateIndex(_DataSource, i => i
.Mappings(ms => ms
.Map<LearnAboutContent>(m => m
.Properties(p => p
.String(s => s.Name(lac => lac.DataSources)
.Analyzer("classic_tokenizer")
.SearchAnalyzer("standard")))))
.Settings(s => s
.Analysis(an => an.Analyzers(a => a.Custom("classic_tokenizer", ca => ca.Tokenizer("classic"))))));
var indexResponse = _elasticClientWrapper.IndexMany(contentList);
构建成功,有数据。但是查询仍然无法正常工作。
数据源的新查询:
foreach (var datasource in dataSources)
{
// Add DataSources with an OR
queryContainer |= new TermQuery {Field = LearnAboutFields.DATASOURCE, Value = datasource};
}
// Add this Boolean Clause to our outer clause with an AND
_boolQuery.Must = new QueryContainer[] {queryContainer};
和JSON:
{"learnabout_index":{"aliases":{},"mappings":{"learnaboutcontent":{"properties":{"articleID":{"type":"string"},"content":{"type":"string"},"dataSources":{"type":"string","analyzer":"classic_tokenizer","search_analyzer":"standard"},"description":{"type":"string"},"fileName":{"type":"string"},"keywords":{"type":"string"},"linkURL":{"type":"string"},"title":{"type":"string"}}}},"settings":{"index":{"creation_date":"1483992041623","analysis":{"analyzer":{"classic_tokenizer":{"type":"custom","tokenizer":"classic"}}},"number_of_shards":"5","number_of_replicas":"1","uuid":"iZakEjBlRiGfNvaFn-yG-w","version":{"created":"2040099"}}},"warmers":{}}}
查询JSON请求:
{
"size": 10000,
"query": {
"bool": {
"must": [
{
"simple_query_string": {
"fields": [
"_all"
],
"query": "\"housing\"",
"analyzer": "standard"
}
}
],
"filter": [
{
"terms": {
"DataSources": [
"1"
]
}
}
]
}
}
}
乍一看您的代码,我认为您可能遇到的一个问题是放置在过滤器子句中的任何查询都不会被分析。所以基本上不会把值分解成token,会整体比较。
这一点很容易被忘记,因此任何需要分析的值都需要放在 must 或 should 子句中。
实现此目的的一种方法是创建一个带有 classic tokenizer 的自定义分析器,它将您的 DataSources
字段分解为组成它的数字,即它将在每个 [=13] 上标记该字段=] 字符.
因此,当您创建索引时,您需要添加此自定义分析器,然后在您的 DataSources
字段中使用它:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"number_analyzer": {
"type": "custom",
"tokenizer": "number_tokenizer"
}
},
"tokenizer": {
"number_tokenizer": {
"type": "classic"
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"DataSources": {
"type": "string",
"analyzer": "number_analyzer",
"search_analyzer": "standard"
}
}
}
}
}
因此,如果您索引字符串 "|4|7|8|9|10|12|14|19|20|21|22|23|29|30"
,您的 DataSources
字段将有效地包含以下标记数组:[4, 7, 8, 9, 10, 12, 14, 191, 20, 21, 22, 23, 29, 30]
然后你可以摆脱你的 WildcardQuery
并简单地使用 TermsQuery
代替:
terms = new TermsQuery {Field = LearnAboutFields.DATASOURCE, Terms = dataSources }
// Add this Boolean Clause to our outer clause with an AND
_boolQuery.Filter = new QueryContainer[] { terms };
我在 ES 2.5 中有一个自定义 class 以下内容:
Title
DataSources
Content
运行 搜索没问题,除了中间字段 - built/indexed 使用分隔符“|”。
ex: "|4|7|8|9|10|12|14|19|20|21|22|23|29|30"
我需要构建一个查询,在所有字段中匹配一些,并且在 DataSource 字段中至少匹配一个数字。
总结一下我目前的情况:
QueryBase query = new SimpleQueryStringQuery
{
//DefaultOperator = !operatorOR ? Operator.And : Operator.Or,
Fields = LearnAboutFields.FULLTEXT,
Analyzer = "standard",
Query = searchWords.ToLower()
};
_boolQuery.Must = new QueryContainer[] {query};
这就是搜索词查询。
foreach (var datasource in dataSources)
{
// Add DataSources with an OR
queryContainer |= new WildcardQuery { Field = LearnAboutFields.DATASOURCE, Value = string.Format("*{0}*", datasource) };
}
// Add this Boolean Clause to our outer clause with an AND
_boolQuery.Filter = new QueryContainer[] {queryContainer};
}
那是数据源查询。可以有多个数据源。
它不起作用,并且 returns 在添加了过滤器查询的结果上。我想我需要在 tokenizer/analyzer 上做一些工作,但我对 ES 的了解还不够多,无法解决这个问题。
编辑:Per Val 在下面的评论我试图像这样重新编码索引器:
_elasticClientWrapper.CreateIndex(_DataSource, i => i
.Mappings(ms => ms
.Map<LearnAboutContent>(m => m
.Properties(p => p
.String(s => s.Name(lac => lac.DataSources)
.Analyzer("classic_tokenizer")
.SearchAnalyzer("standard")))))
.Settings(s => s
.Analysis(an => an.Analyzers(a => a.Custom("classic_tokenizer", ca => ca.Tokenizer("classic"))))));
var indexResponse = _elasticClientWrapper.IndexMany(contentList);
构建成功,有数据。但是查询仍然无法正常工作。
数据源的新查询:
foreach (var datasource in dataSources)
{
// Add DataSources with an OR
queryContainer |= new TermQuery {Field = LearnAboutFields.DATASOURCE, Value = datasource};
}
// Add this Boolean Clause to our outer clause with an AND
_boolQuery.Must = new QueryContainer[] {queryContainer};
和JSON:
{"learnabout_index":{"aliases":{},"mappings":{"learnaboutcontent":{"properties":{"articleID":{"type":"string"},"content":{"type":"string"},"dataSources":{"type":"string","analyzer":"classic_tokenizer","search_analyzer":"standard"},"description":{"type":"string"},"fileName":{"type":"string"},"keywords":{"type":"string"},"linkURL":{"type":"string"},"title":{"type":"string"}}}},"settings":{"index":{"creation_date":"1483992041623","analysis":{"analyzer":{"classic_tokenizer":{"type":"custom","tokenizer":"classic"}}},"number_of_shards":"5","number_of_replicas":"1","uuid":"iZakEjBlRiGfNvaFn-yG-w","version":{"created":"2040099"}}},"warmers":{}}}
查询JSON请求:
{
"size": 10000,
"query": {
"bool": {
"must": [
{
"simple_query_string": {
"fields": [
"_all"
],
"query": "\"housing\"",
"analyzer": "standard"
}
}
],
"filter": [
{
"terms": {
"DataSources": [
"1"
]
}
}
]
}
}
}
乍一看您的代码,我认为您可能遇到的一个问题是放置在过滤器子句中的任何查询都不会被分析。所以基本上不会把值分解成token,会整体比较。
这一点很容易被忘记,因此任何需要分析的值都需要放在 must 或 should 子句中。
实现此目的的一种方法是创建一个带有 classic tokenizer 的自定义分析器,它将您的 DataSources
字段分解为组成它的数字,即它将在每个 [=13] 上标记该字段=] 字符.
因此,当您创建索引时,您需要添加此自定义分析器,然后在您的 DataSources
字段中使用它:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"number_analyzer": {
"type": "custom",
"tokenizer": "number_tokenizer"
}
},
"tokenizer": {
"number_tokenizer": {
"type": "classic"
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"DataSources": {
"type": "string",
"analyzer": "number_analyzer",
"search_analyzer": "standard"
}
}
}
}
}
因此,如果您索引字符串 "|4|7|8|9|10|12|14|19|20|21|22|23|29|30"
,您的 DataSources
字段将有效地包含以下标记数组:[4, 7, 8, 9, 10, 12, 14, 191, 20, 21, 22, 23, 29, 30]
然后你可以摆脱你的 WildcardQuery
并简单地使用 TermsQuery
代替:
terms = new TermsQuery {Field = LearnAboutFields.DATASOURCE, Terms = dataSources }
// Add this Boolean Clause to our outer clause with an AND
_boolQuery.Filter = new QueryContainer[] { terms };