标签上的 ElasticSearch 查询

Question

我正在尝试破解 elasticsearch 查询语言，到目前为止我做得不是很好。

我的文档有以下映射。

{
    "mappings": {
        "jsondoc": {
            "properties": {
                "header" : {
                    "type" : "nested",
                    "properties" : {
                        "plainText" : { "type" : "string" },
                        "title" : { "type" : "string" },
                        "year" : { "type" : "string" },
                        "pages" : { "type" : "string" }
                    }
                },
                "sentences": {
                    "type": "nested",
                    "properties": {
                        "id": { "type": "integer" },
                        "text": { "type": "string" },
                        "tokens": { "type": "nested" },
                        "rhetoricalClass": { "type": "string" },
                        "babelSynsetsOcc": {
                            "type": "nested",
                            "properties" : {
                                "id" : { "type" : "integer" },
                                "text" : { "type" : "string" },
                                "synsetID" : { "type" : "string" }
                            }
                        }
                    }
                }
            }
        }
    }
}

它主要类似于引用 pdf 文档的 JSON 文件。

我一直在尝试使用聚合进行查询，到目前为止进展顺利。我已经到了按（聚合）rhetoricalClass 分组的地步，得到 babelSynsetsOcc.synsetID 的总重复次数。哎呀，即使是相同的查询，即使通过 header.year

对整个结果进行分组

但是，现在，我正在努力过滤包含术语的文档并执行相同的查询。

那么，我如何进行查询，以便按 rhetoricalClass 分组并仅考虑字段 header.plainText 包含 ["Computational", "Compositional", "Semantics"] 的那些文档。我的意思是 contain 而不是 equal!.

如果我要粗略翻译成 SQL，它会类似于

SELECT count(sentences.babelSynsetsOcc.synsetID)
FROM jsondoc
WHERE header.plainText like '%Computational%' OR header.plainText like '%Compositional%' OR header.plainText like '%Sematics%'
GROUP BY sentences.rhetoricalClass

Answer 1

WHERE 子句只是标准的结构化查询，因此它们转换为 Elasticsearch 中的查询。

GROUP BY 和 HAVING 松散地转换为 Elasticsearch 的 DSL 中的聚合。 count、min、max 和 sum 等函数是 GROUP BY 的函数，因此也是一个聚合。

您正在使用 nested 对象这一事实可能是必要的，但它会为接触它们的每个部分添加一个额外的层。如果那些 nested 对象是 而不是 数组，那么不要使用 nested；在这种情况下使用 object。

我可能会考虑将您的查询翻译成：

{
  "query": {
    "nested": {
      "path": "header",
      "query": {
        "bool": {
          "should": [
            {
              "match": {
                "header.plainText" : "Computational"
              }
            },
            {
              "match": {
                "header.plainText" : "Compositional"
              }
            },
            {
              "match": {
                "header.plainText" : "Semantics"
              }
            }
          ]
        }
      }
    }
  }
}

或者，它可以重写为这样，其意图不太明显：

{
  "query": {
    "nested": {
      "path": "header",
      "query": {
        "match": {
          "header.plainText": "Computational Compositional Semantics"
        }
      }
    }
  }
}

聚合将是：

{
  "aggs": {
    "nested_sentences": {
      "nested": {
        "path": "sentences"
      },
      "group_by_rhetorical_class": {
        "terms": {
          "field": "sentences.rhetoricalClass",
          "size": 10
        },
        "aggs": {
          "nested_babel": {
            "path": "sentences.babelSynsetsOcc"
          },
          "aggs": {
            "count_synset_id": {
              "count": {
                "field": "sentences.babelSynsetsOcc.synsetID"
              }
            }
          }
        }
      }
    }
  }
}

现在，如果您将它们组合起来并丢弃命中（因为您只是在寻找聚合结果），那么它看起来像这样：

{
  "size": 0,
  "query": {
    "nested": {
      "path": "header",
      "query": {
        "match": {
          "header.plainText": "Computational Compositional Semantics"
        }
      }
    }
  },
  "aggs": {
    "nested_sentences": {
      "nested": {
        "path": "sentences"
      },
      "group_by_rhetorical_class": {
        "terms": {
          "field": "sentences.rhetoricalClass",
          "size": 10
        },
        "aggs": {
          "nested_babel": {
            "path": "sentences.babelSynsetsOcc"
          },
          "aggs": {
            "count_synset_id": {
              "count": {
                "field": "sentences.babelSynsetsOcc.synsetID"
              }
            }
          }
        }
      }
    }
  }
}

标签上的 ElasticSearch 查询

ElasticSearch query on tags

querydsl

elasticsearch