elasticsearch 禁用词频评分
elasticsearch disable term frequency scoring
我想更改 elasticsearch 中的评分系统,以摆脱对一个术语的多次出现进行计数。例如,我想要:
“得克萨斯州得克萨斯州得克萨斯州”
和
“德州”
同分出来。我发现 elasticsearch 说这个映射会禁用词频计数,但我的搜索结果不一样:
"mappings":{
"business": {
"properties" : {
"name" : {
"type" : "string",
"index_options" : "docs",
"norms" : { "enabled": false}}
}
}
}
}
任何帮助将不胜感激,我找不到很多这方面的信息。
我正在添加我的搜索代码以及使用 explain 时返回的内容。
我的搜索码:
Settings settings = ImmutableSettings.settingsBuilder().put("cluster.name", "escluster").build();
Client client = new TransportClient(settings)
.addTransportAddress(new InetSocketTransportAddress("127.0.0.1", 9300));
SearchRequest request = Requests.searchRequest("businesses")
.source(SearchSourceBuilder.searchSource().query(QueryBuilders.boolQuery()
.should(QueryBuilders.matchQuery("name", "Texas")
.minimumShouldMatch("1")))).searchType(SearchType.DFS_QUERY_THEN_FETCH);
ExplainRequest request2 = client.prepareIndex("businesses", "business")
当我用 explain 搜索时,我得到:
"took" : 14,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [ {
"_shard" : 1,
"_node" : "BTqBPVDET5Kr83r-CYPqfA",
"_index" : "businesses",
"_type" : "business",
"_id" : "AU9U5KBks4zEorv9YI4n",
"_score" : 1.0,
"_source":{
"name" : "texas"
}
,
"_explanation" : {
"value" : 1.0,
"description" : "weight(_all:texas in 0) [PerFieldSimilarity], result of:",
"details" : [ {
"value" : 1.0,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "termFreq=1.0"
} ]
}, {
"value" : 1.0,
"description" : "idf(docFreq=2, maxDocs=3)"
}, {
"value" : 1.0,
"description" : "fieldNorm(doc=0)"
} ]
} ]
}
}, {
"_shard" : 1,
"_node" : "BTqBPVDET5Kr83r-CYPqfA",
"_index" : "businesses",
"_type" : "business",
"_id" : "AU9U5K6Ks4zEorv9YI4o",
"_score" : 0.8660254,
"_source":{
"name" : "texas texas texas"
}
,
"_explanation" : {
"value" : 0.8660254,
"description" : "weight(_all:texas in 0) [PerFieldSimilarity], result of:",
"details" : [ {
"value" : 0.8660254,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.7320508,
"description" : "tf(freq=3.0), with freq of:",
"details" : [ {
"value" : 3.0,
"description" : "termFreq=3.0"
} ]
}, {
"value" : 1.0,
"description" : "idf(docFreq=2, maxDocs=3)"
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=0)"
} ]
} ]
}
} ]
}
看来还在考虑frequency和doc frequency。有任何想法吗?对不起,格式不好我不知道为什么它看起来如此怪诞。
我的代码来自浏览器搜索http://localhost:9200/businesses/business/_search?pretty=true&qname=texas
是:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"hits" : {
"total" : 4,
"max_score" : 1.0,
"hits" : [ {
"_index" : "businesses",
"_type" : "business",
"_id" : "AU9YcCKjKvtg8NgyozGK",
"_score" : 1.0,
"_source":{"business" : {
"name" : "texas texas texas texas" }
}
}, {
"_index" : "businesses",
"_type" : "business",
"_id" : "AU9YateBKvtg8Ngyoy-p",
"_score" : 1.0,
"_source":{
"name" : "texas" }
}, {
"_index" : "businesses",
"_type" : "business",
"_id" : "AU9YavVnKvtg8Ngyoy-4",
"_score" : 1.0,
"_source":{
"name" : "texas texas texas" }
}, {
"_index" : "businesses",
"_type" : "business",
"_id" : "AU9Yb7NgKvtg8NgyozFf",
"_score" : 1.0,
"_source":{"business" : {
"name" : "texas texas texas" }
}
} ]
}
}
它找到了我在其中的所有 4 个对象,并且它们的分数都相同。
当我 运行 我的 java API 使用解释搜索时,我得到:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.287682,
"hits" : [ {
"_shard" : 1,
"_node" : "BTqBPVDET5Kr83r-CYPqfA",
"_index" : "businesses",
"_type" : "business",
"_id" : "AU9YateBKvtg8Ngyoy-p",
"_score" : 1.287682,
"_source":{
"name" : "texas" }
,
"_explanation" : {
"value" : 1.287682,
"description" : "weight(name:texas in 0) [PerFieldSimilarity], result of:",
"details" : [ {
"value" : 1.287682,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "termFreq=1.0"
} ]
}, {
"value" : 1.287682,
"description" : "idf(docFreq=2, maxDocs=4)"
}, {
"value" : 1.0,
"description" : "fieldNorm(doc=0)"
} ]
} ]
}
}, {
"_shard" : 1,
"_node" : "BTqBPVDET5Kr83r-CYPqfA",
"_index" : "businesses",
"_type" : "business",
"_id" : "AU9YavVnKvtg8Ngyoy-4",
"_score" : 1.1151654,
"_source":{
"name" : "texas texas texas" }
,
"_explanation" : {
"value" : 1.1151654,
"description" : "weight(name:texas in 0) [PerFieldSimilarity], result of:",
"details" : [ {
"value" : 1.1151654,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.7320508,
"description" : "tf(freq=3.0), with freq of:",
"details" : [ {
"value" : 3.0,
"description" : "termFreq=3.0"
} ]
}, {
"value" : 1.287682,
"description" : "idf(docFreq=2, maxDocs=4)"
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=0)"
} ]
} ]
}
} ]
}
}
在映射
中初始设置字段后,似乎无法覆盖该字段的index options
示例:
put test
put test/business/_mapping
{
"properties": {
"name": {
"type": "string",
"index_options": "freqs",
"norms": {
"enabled": false
}
}
}
}
put test/business/_mapping
{
"properties": {
"name": {
"type": "string",
"index_options": "docs",
"norms": {
"enabled": false
}
}
}
}
get test/business/_mapping
{
"test": {
"mappings": {
"business": {
"properties": {
"name": {
"type": "string",
"norms": {
"enabled": false
},
"index_options": "freqs"
}
}
}
}
}
}
您必须重新创建索引才能获取新映射
您的字段类型必须是文本
您必须重新索引 elasticsearch - 创建一个新索引
"mappings": {
"properties": {
"text": {
"type": "text",
"index_options": "docs"
}
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/current/index-options.html
我想更改 elasticsearch 中的评分系统,以摆脱对一个术语的多次出现进行计数。例如,我想要:
“得克萨斯州得克萨斯州得克萨斯州”
和
“德州”
同分出来。我发现 elasticsearch 说这个映射会禁用词频计数,但我的搜索结果不一样:
"mappings":{
"business": {
"properties" : {
"name" : {
"type" : "string",
"index_options" : "docs",
"norms" : { "enabled": false}}
}
}
}
}
任何帮助将不胜感激,我找不到很多这方面的信息。
我正在添加我的搜索代码以及使用 explain 时返回的内容。
我的搜索码:
Settings settings = ImmutableSettings.settingsBuilder().put("cluster.name", "escluster").build();
Client client = new TransportClient(settings)
.addTransportAddress(new InetSocketTransportAddress("127.0.0.1", 9300));
SearchRequest request = Requests.searchRequest("businesses")
.source(SearchSourceBuilder.searchSource().query(QueryBuilders.boolQuery()
.should(QueryBuilders.matchQuery("name", "Texas")
.minimumShouldMatch("1")))).searchType(SearchType.DFS_QUERY_THEN_FETCH);
ExplainRequest request2 = client.prepareIndex("businesses", "business")
当我用 explain 搜索时,我得到:
"took" : 14,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [ {
"_shard" : 1,
"_node" : "BTqBPVDET5Kr83r-CYPqfA",
"_index" : "businesses",
"_type" : "business",
"_id" : "AU9U5KBks4zEorv9YI4n",
"_score" : 1.0,
"_source":{
"name" : "texas"
}
,
"_explanation" : {
"value" : 1.0,
"description" : "weight(_all:texas in 0) [PerFieldSimilarity], result of:",
"details" : [ {
"value" : 1.0,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "termFreq=1.0"
} ]
}, {
"value" : 1.0,
"description" : "idf(docFreq=2, maxDocs=3)"
}, {
"value" : 1.0,
"description" : "fieldNorm(doc=0)"
} ]
} ]
}
}, {
"_shard" : 1,
"_node" : "BTqBPVDET5Kr83r-CYPqfA",
"_index" : "businesses",
"_type" : "business",
"_id" : "AU9U5K6Ks4zEorv9YI4o",
"_score" : 0.8660254,
"_source":{
"name" : "texas texas texas"
}
,
"_explanation" : {
"value" : 0.8660254,
"description" : "weight(_all:texas in 0) [PerFieldSimilarity], result of:",
"details" : [ {
"value" : 0.8660254,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.7320508,
"description" : "tf(freq=3.0), with freq of:",
"details" : [ {
"value" : 3.0,
"description" : "termFreq=3.0"
} ]
}, {
"value" : 1.0,
"description" : "idf(docFreq=2, maxDocs=3)"
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=0)"
} ]
} ]
}
} ]
}
看来还在考虑frequency和doc frequency。有任何想法吗?对不起,格式不好我不知道为什么它看起来如此怪诞。
我的代码来自浏览器搜索http://localhost:9200/businesses/business/_search?pretty=true&qname=texas 是:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"hits" : {
"total" : 4,
"max_score" : 1.0,
"hits" : [ {
"_index" : "businesses",
"_type" : "business",
"_id" : "AU9YcCKjKvtg8NgyozGK",
"_score" : 1.0,
"_source":{"business" : {
"name" : "texas texas texas texas" }
}
}, {
"_index" : "businesses",
"_type" : "business",
"_id" : "AU9YateBKvtg8Ngyoy-p",
"_score" : 1.0,
"_source":{
"name" : "texas" }
}, {
"_index" : "businesses",
"_type" : "business",
"_id" : "AU9YavVnKvtg8Ngyoy-4",
"_score" : 1.0,
"_source":{
"name" : "texas texas texas" }
}, {
"_index" : "businesses",
"_type" : "business",
"_id" : "AU9Yb7NgKvtg8NgyozFf",
"_score" : 1.0,
"_source":{"business" : {
"name" : "texas texas texas" }
}
} ]
}
}
它找到了我在其中的所有 4 个对象,并且它们的分数都相同。 当我 运行 我的 java API 使用解释搜索时,我得到:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.287682,
"hits" : [ {
"_shard" : 1,
"_node" : "BTqBPVDET5Kr83r-CYPqfA",
"_index" : "businesses",
"_type" : "business",
"_id" : "AU9YateBKvtg8Ngyoy-p",
"_score" : 1.287682,
"_source":{
"name" : "texas" }
,
"_explanation" : {
"value" : 1.287682,
"description" : "weight(name:texas in 0) [PerFieldSimilarity], result of:",
"details" : [ {
"value" : 1.287682,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "termFreq=1.0"
} ]
}, {
"value" : 1.287682,
"description" : "idf(docFreq=2, maxDocs=4)"
}, {
"value" : 1.0,
"description" : "fieldNorm(doc=0)"
} ]
} ]
}
}, {
"_shard" : 1,
"_node" : "BTqBPVDET5Kr83r-CYPqfA",
"_index" : "businesses",
"_type" : "business",
"_id" : "AU9YavVnKvtg8Ngyoy-4",
"_score" : 1.1151654,
"_source":{
"name" : "texas texas texas" }
,
"_explanation" : {
"value" : 1.1151654,
"description" : "weight(name:texas in 0) [PerFieldSimilarity], result of:",
"details" : [ {
"value" : 1.1151654,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.7320508,
"description" : "tf(freq=3.0), with freq of:",
"details" : [ {
"value" : 3.0,
"description" : "termFreq=3.0"
} ]
}, {
"value" : 1.287682,
"description" : "idf(docFreq=2, maxDocs=4)"
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=0)"
} ]
} ]
}
} ]
}
}
在映射
中初始设置字段后,似乎无法覆盖该字段的index options
示例:
put test
put test/business/_mapping
{
"properties": {
"name": {
"type": "string",
"index_options": "freqs",
"norms": {
"enabled": false
}
}
}
}
put test/business/_mapping
{
"properties": {
"name": {
"type": "string",
"index_options": "docs",
"norms": {
"enabled": false
}
}
}
}
get test/business/_mapping
{
"test": {
"mappings": {
"business": {
"properties": {
"name": {
"type": "string",
"norms": {
"enabled": false
},
"index_options": "freqs"
}
}
}
}
}
}
您必须重新创建索引才能获取新映射
您的字段类型必须是文本
您必须重新索引 elasticsearch - 创建一个新索引
"mappings": {
"properties": {
"text": {
"type": "text",
"index_options": "docs"
}
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/current/index-options.html