如何在pydruid中使用ThetaSketchOp函数
How to use the ThetaSketchOp function in pydruid
我正在使用 pydruid 查询德鲁伊数据库,并希望计算 post- 聚合结果,其中一个聚合为真,另一个为假。
我已经能够使用 curl 计算 post-聚合结果,以 POST 对德鲁伊数据库的 JSON 格式查询。
使用 pydruid 我已经能够计算初始聚合和两个聚合组相交的 post-聚合。我试图找到一种方法来使用 ThetaSketchOp class 来达到我的目的,但到目前为止没有任何成功。
到目前为止,这是我在 pydruid 中使用 ThetaSketchOp class 的尝试:
result = query.groupby(
datasource='datasource',
granularity='all',
intervals='2018-06-30/2018-08-30',
filter=(
(filters.Dimension('fruit') == 'apple') |
(filters.Dimension('fruit') == 'orange')
),
aggregations={
'apple': aggregators.filtered(
filters.Dimension('fruit') == 'apple',
aggregators.thetasketch('person')),
'orange': aggregators.filtered(
(filters.Dimension('fruit') == 'orange'),
aggregators.thetasketch('person')),
},
post_aggregations={
'apple_&_orange': postaggregator.ThetaSketchEstimate(
postaggregator.ThetaSketch('apple') &
postaggregator.ThetaSketch('orange')
),
'apple_&_not_orange': postaggregator.ThetaSketchEstimate(
postaggregator.ThetaSketchOp(
fn='not',
fields=[
postaggregator.ThetaSketch('apple'),
postaggregator.ThetaSketch('orange')
],
name='testing'
)
)
}
)
这里是 json 格式的查询,当用于查询德鲁伊数据库时,它会产生所需的结果:
{
"queryType": "groupBy",
"dataSource": "datasource",
"granularity": "ALL",
"dimensions": [],
"aggregations": [
{
"type" : "filtered",
"filter" : {
"type" : "selector",
"dimension" : "fruit",
"value" : "apple"
},
"aggregator" : {
"type": "thetaSketch", "name": "apple", "fieldName": "person"
}
},
{
"type" : "filtered",
"filter" : {
"type" : "selector",
"dimension" : "fruit",
"value" : "orange"
},
"aggregator" : {
"type": "thetaSketch", "name": "orange", "fieldName": "person"
}
}
],
"postAggregations": [
{
"type": "thetaSketchEstimate",
"name": "apple_&_orange",
"field":
{
"type": "thetaSketchSetOp",
"name": "final_unique_users_sketch",
"func": "INTERSECT",
"fields": [
{
"type": "fieldAccess",
"fieldName": "apple"
},
{
"type": "fieldAccess",
"fieldName": "orange"
}
]
}
},
{
"type": "thetaSketchEstimate",
"name": "apple_&_not_orange",
"field":
{
"type": "thetaSketchSetOp",
"name": "final_unique_users_sketch",
"func": "NOT",
"fields": [
{
"type": "fieldAccess",
"fieldName": "apple"
},
{
"type": "fieldAccess",
"fieldName": "orange"
}
]
}
}
],
"intervals": [ "2018-06-30T23:00:05.000Z/2019-07-01T17:00:05.000Z" ]
}
感谢阅读。如果我需要提供任何其他信息,请告诉我。
如果您使用 !=
运算符创建 NOT
theta 草图操作,似乎可以工作:
result = query.groupby(
datasource='datasource',
granularity='all',
intervals='2018-06-30/2018-08-30',
filter=(
(filters.Dimension('fruit') == 'apple') |
(filters.Dimension('fruit') == 'orange')
),
aggregations={
'apple': aggregators.filtered(
filters.Dimension('fruit') == 'apple',
aggregators.thetasketch('person')),
'orange': aggregators.filtered(
(filters.Dimension('fruit') == 'orange'),
aggregators.thetasketch('person')),
},
post_aggregations={
'apple_&_orange': postaggregator.ThetaSketchEstimate(
postaggregator.ThetaSketch('apple') &
postaggregator.ThetaSketch('orange')
),
'apple_&_not_orange': postaggregator.ThetaSketchEstimate(
postaggregator.ThetaSketch('apple') !=
postaggregator.ThetaSketch('orange')
)
}
)
(我通过深入研究 pydruid 源代码发现了这一点。)
我正在使用 pydruid 查询德鲁伊数据库,并希望计算 post- 聚合结果,其中一个聚合为真,另一个为假。
我已经能够使用 curl 计算 post-聚合结果,以 POST 对德鲁伊数据库的 JSON 格式查询。
使用 pydruid 我已经能够计算初始聚合和两个聚合组相交的 post-聚合。我试图找到一种方法来使用 ThetaSketchOp class 来达到我的目的,但到目前为止没有任何成功。
到目前为止,这是我在 pydruid 中使用 ThetaSketchOp class 的尝试:
result = query.groupby(
datasource='datasource',
granularity='all',
intervals='2018-06-30/2018-08-30',
filter=(
(filters.Dimension('fruit') == 'apple') |
(filters.Dimension('fruit') == 'orange')
),
aggregations={
'apple': aggregators.filtered(
filters.Dimension('fruit') == 'apple',
aggregators.thetasketch('person')),
'orange': aggregators.filtered(
(filters.Dimension('fruit') == 'orange'),
aggregators.thetasketch('person')),
},
post_aggregations={
'apple_&_orange': postaggregator.ThetaSketchEstimate(
postaggregator.ThetaSketch('apple') &
postaggregator.ThetaSketch('orange')
),
'apple_&_not_orange': postaggregator.ThetaSketchEstimate(
postaggregator.ThetaSketchOp(
fn='not',
fields=[
postaggregator.ThetaSketch('apple'),
postaggregator.ThetaSketch('orange')
],
name='testing'
)
)
}
)
这里是 json 格式的查询,当用于查询德鲁伊数据库时,它会产生所需的结果:
{
"queryType": "groupBy",
"dataSource": "datasource",
"granularity": "ALL",
"dimensions": [],
"aggregations": [
{
"type" : "filtered",
"filter" : {
"type" : "selector",
"dimension" : "fruit",
"value" : "apple"
},
"aggregator" : {
"type": "thetaSketch", "name": "apple", "fieldName": "person"
}
},
{
"type" : "filtered",
"filter" : {
"type" : "selector",
"dimension" : "fruit",
"value" : "orange"
},
"aggregator" : {
"type": "thetaSketch", "name": "orange", "fieldName": "person"
}
}
],
"postAggregations": [
{
"type": "thetaSketchEstimate",
"name": "apple_&_orange",
"field":
{
"type": "thetaSketchSetOp",
"name": "final_unique_users_sketch",
"func": "INTERSECT",
"fields": [
{
"type": "fieldAccess",
"fieldName": "apple"
},
{
"type": "fieldAccess",
"fieldName": "orange"
}
]
}
},
{
"type": "thetaSketchEstimate",
"name": "apple_&_not_orange",
"field":
{
"type": "thetaSketchSetOp",
"name": "final_unique_users_sketch",
"func": "NOT",
"fields": [
{
"type": "fieldAccess",
"fieldName": "apple"
},
{
"type": "fieldAccess",
"fieldName": "orange"
}
]
}
}
],
"intervals": [ "2018-06-30T23:00:05.000Z/2019-07-01T17:00:05.000Z" ]
}
感谢阅读。如果我需要提供任何其他信息,请告诉我。
如果您使用 !=
运算符创建 NOT
theta 草图操作,似乎可以工作:
result = query.groupby(
datasource='datasource',
granularity='all',
intervals='2018-06-30/2018-08-30',
filter=(
(filters.Dimension('fruit') == 'apple') |
(filters.Dimension('fruit') == 'orange')
),
aggregations={
'apple': aggregators.filtered(
filters.Dimension('fruit') == 'apple',
aggregators.thetasketch('person')),
'orange': aggregators.filtered(
(filters.Dimension('fruit') == 'orange'),
aggregators.thetasketch('person')),
},
post_aggregations={
'apple_&_orange': postaggregator.ThetaSketchEstimate(
postaggregator.ThetaSketch('apple') &
postaggregator.ThetaSketch('orange')
),
'apple_&_not_orange': postaggregator.ThetaSketchEstimate(
postaggregator.ThetaSketch('apple') !=
postaggregator.ThetaSketch('orange')
)
}
)
(我通过深入研究 pydruid 源代码发现了这一点。)