如何在pydruid中使用ThetaSketchOp函数

How to use the ThetaSketchOp function in pydruid

我正在使用 pydruid 查询德鲁伊数据库,并希望计算 post- 聚合结果,其中一个聚合为真,另一个为假。

我已经能够使用 curl 计算 post-聚合结果,以 POST 对德鲁伊数据库的 JSON 格式查询。

使用 pydruid 我已经能够计算初始聚合和两个聚合组相交的 post-聚合。我试图找到一种方法来使用 ThetaSketchOp class 来达到我的目的,但到目前为止没有任何成功。

到目前为止,这是我在 pydruid 中使用 ThetaSketchOp class 的尝试:

result = query.groupby(
    datasource='datasource',
    granularity='all',
    intervals='2018-06-30/2018-08-30',
    filter=(
        (filters.Dimension('fruit') == 'apple') |
        (filters.Dimension('fruit') == 'orange') 
    ),    
    aggregations={
        'apple': aggregators.filtered(
            filters.Dimension('fruit') == 'apple',
            aggregators.thetasketch('person')),
        'orange': aggregators.filtered(
            (filters.Dimension('fruit') == 'orange'),
            aggregators.thetasketch('person')),
    },
    post_aggregations={
        'apple_&_orange': postaggregator.ThetaSketchEstimate(
                postaggregator.ThetaSketch('apple') &
                postaggregator.ThetaSketch('orange')                
        ),
        'apple_&_not_orange': postaggregator.ThetaSketchEstimate(
            postaggregator.ThetaSketchOp(
                fn='not', 
                fields=[
                    postaggregator.ThetaSketch('apple'),
                    postaggregator.ThetaSketch('orange')
                ],
                name='testing'
            )
        )
    }
)

这里是 json 格式的查询,当用于查询德鲁伊数据库时,它会产生所需的结果:

{
"queryType": "groupBy",
  "dataSource": "datasource",
  "granularity": "ALL",
  "dimensions": [],
  "aggregations": [
    {
      "type" : "filtered",
      "filter" : {
        "type" : "selector",
        "dimension" : "fruit",
        "value" : "apple"
      },
      "aggregator" :     {
        "type": "thetaSketch", "name": "apple", "fieldName": "person"
      }
    },
    {
      "type" : "filtered",
      "filter" : {
        "type" : "selector",
        "dimension" : "fruit",
        "value" : "orange"
      },
      "aggregator" :     {
        "type": "thetaSketch", "name": "orange", "fieldName": "person"
      }
    }
  ],
  "postAggregations": [
    {
      "type": "thetaSketchEstimate",
      "name": "apple_&_orange",
      "field":
      {
        "type": "thetaSketchSetOp",
        "name": "final_unique_users_sketch",
        "func": "INTERSECT",
        "fields": [
          {
            "type": "fieldAccess",
            "fieldName": "apple"
          },
          {
            "type": "fieldAccess",
            "fieldName": "orange"
          }
        ]
      }
    },
    {
      "type": "thetaSketchEstimate",
      "name": "apple_&_not_orange",
      "field":
      {
        "type": "thetaSketchSetOp",
        "name": "final_unique_users_sketch",
        "func": "NOT",
        "fields": [
          {
            "type": "fieldAccess",
            "fieldName": "apple"
          },
          {
            "type": "fieldAccess",
            "fieldName": "orange"
          }
        ]
      }
    }
  ],
  "intervals": [ "2018-06-30T23:00:05.000Z/2019-07-01T17:00:05.000Z" ]
}

感谢阅读。如果我需要提供任何其他信息,请告诉我。

如果您使用 != 运算符创建 NOT theta 草图操作,似乎可以工作:

result = query.groupby(
    datasource='datasource',
    granularity='all',
    intervals='2018-06-30/2018-08-30',
    filter=(
        (filters.Dimension('fruit') == 'apple') |
        (filters.Dimension('fruit') == 'orange') 
    ),    
    aggregations={
        'apple': aggregators.filtered(
            filters.Dimension('fruit') == 'apple',
            aggregators.thetasketch('person')),
        'orange': aggregators.filtered(
            (filters.Dimension('fruit') == 'orange'),
            aggregators.thetasketch('person')),
    },
    post_aggregations={
        'apple_&_orange': postaggregator.ThetaSketchEstimate(
                postaggregator.ThetaSketch('apple') &
                postaggregator.ThetaSketch('orange')                
        ),
        'apple_&_not_orange': postaggregator.ThetaSketchEstimate(
                    postaggregator.ThetaSketch('apple') !=
                    postaggregator.ThetaSketch('orange')
            )
    }
)

(我通过深入研究 pydruid 源代码发现了这一点。)