Elasticsearch significant terms aggregation doc_count 在对相同术语进行匹配短语搜索时不同于命中
Elasticsearch significant terms aggregation doc_count differs from hits when doing a match phrase search for the same term
我正在使用重要术语聚合,它使用以下查询为我提供了 n 个重要术语及其 doc_count 和 bg_count:
{
"query" : {
"terms" : {"user_id": ["x"]}
},
"aggregations" : {
"word_cloud" : {
"significant_terms": {
"field" : "transcript.results.alternatives.words.word.keyword",
"size": 200
}
}
},
"size": 0
}
如果我使用重要术语聚合返回的术语并对该术语进行匹配短语查询。然后我得到的命中值与聚合中的 doc_count 不同。
匹配词组查询:
{
"query": {
"bool": {
"must": [
{
"match_phrase": {
"preprocess_data.results.alternatives.transcript": "<term>"
}
},
{
"match_phrase": {
"user_id": "x"
}
}
]
}
},
"from": 0,
"size": 22
}
字段 preprocess_data.results.alternatives.transcript
具有以下映射:
{
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
我无法解释进行聚合和匹配短语搜索时文档计数的差异。请帮忙。
此行为是因为有关 doc_count
的数据是从索引的所有分片中获取的,并且在重要术语聚合的情况下,此数据可能是近似值。引用弹性搜索 documentation:
The counts of how many documents contain a term provided in results
are based on summing the samples returned from each shard and as such
may be:
- low if certain shards did not provide figures for a given term in their top sample
- high when considering the background frequency as it may count occurrences found in deleted documents
Like most design decisions, this is the basis of a trade-off in which
we have chosen to provide fast performance at the cost of some
(typically small) inaccuracies. However, the size and shard size
settings covered in the next section provide tools to help control the
accuracy levels
我正在使用重要术语聚合,它使用以下查询为我提供了 n 个重要术语及其 doc_count 和 bg_count:
{
"query" : {
"terms" : {"user_id": ["x"]}
},
"aggregations" : {
"word_cloud" : {
"significant_terms": {
"field" : "transcript.results.alternatives.words.word.keyword",
"size": 200
}
}
},
"size": 0
}
如果我使用重要术语聚合返回的术语并对该术语进行匹配短语查询。然后我得到的命中值与聚合中的 doc_count 不同。
匹配词组查询:
{
"query": {
"bool": {
"must": [
{
"match_phrase": {
"preprocess_data.results.alternatives.transcript": "<term>"
}
},
{
"match_phrase": {
"user_id": "x"
}
}
]
}
},
"from": 0,
"size": 22
}
字段 preprocess_data.results.alternatives.transcript
具有以下映射:
{
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
我无法解释进行聚合和匹配短语搜索时文档计数的差异。请帮忙。
此行为是因为有关 doc_count
的数据是从索引的所有分片中获取的,并且在重要术语聚合的情况下,此数据可能是近似值。引用弹性搜索 documentation:
The counts of how many documents contain a term provided in results are based on summing the samples returned from each shard and as such may be:
- low if certain shards did not provide figures for a given term in their top sample
- high when considering the background frequency as it may count occurrences found in deleted documents
Like most design decisions, this is the basis of a trade-off in which we have chosen to provide fast performance at the cost of some (typically small) inaccuracies. However, the size and shard size settings covered in the next section provide tools to help control the accuracy levels