如何根据百分位数过滤 table 然后在 HQL 中随机抽样?
How to filter table based on percentile and then random sample in HQL?
我正在尝试从 table 中随机抽样 200 行,但首先我想对其进行过滤以仅从变量中选取前 1% 的值。
我收到以下错误 -
Error while compiling statement: FAILED: ParseException line 3:31
cannot recognize input near 'select' 'percentile_approx' '(' in
expression specification
下面是我的查询-
> with sample_pop as (select * from
> mytable a where
> a.transaction_amount > (select
> percentile_approx(transaction_amount, 0.99) as top1
> from mytable) )
>
> select * from sample_pop distribute by rand(1) sort by rand(1) limit
> 200;
我不认为 Hive 支持标量子查询的方式与您使用它们的方式相同(仅适用于 IN
/EXISTS
)。所以将逻辑移动到 FROM
子句:
with sample_pop as (
select *
from mytable a cross join
(select percentile_approx(transaction_amount, 0.99) as top1
from mytable
) aa
where a.transaction_amount > aa.top1
)
select *
from sample_pop distribute by rand(1)
order by rand(1)
limit 200;
通过以下查询解决了我的问题 -
with sample_pop as (select a.* from
(
select *, cum_dist() over (order by transaction_amount asc) pct
from mytable
) a
where pct >= 0.99
)
select *
from sample_pop distribute by rand(1)
order by rand(1)
limit 200;
我正在尝试从 table 中随机抽样 200 行,但首先我想对其进行过滤以仅从变量中选取前 1% 的值。
我收到以下错误 -
Error while compiling statement: FAILED: ParseException line 3:31 cannot recognize input near 'select' 'percentile_approx' '(' in expression specification
下面是我的查询-
> with sample_pop as (select * from
> mytable a where
> a.transaction_amount > (select
> percentile_approx(transaction_amount, 0.99) as top1
> from mytable) )
>
> select * from sample_pop distribute by rand(1) sort by rand(1) limit
> 200;
我不认为 Hive 支持标量子查询的方式与您使用它们的方式相同(仅适用于 IN
/EXISTS
)。所以将逻辑移动到 FROM
子句:
with sample_pop as (
select *
from mytable a cross join
(select percentile_approx(transaction_amount, 0.99) as top1
from mytable
) aa
where a.transaction_amount > aa.top1
)
select *
from sample_pop distribute by rand(1)
order by rand(1)
limit 200;
通过以下查询解决了我的问题 -
with sample_pop as (select a.* from
(
select *, cum_dist() over (order by transaction_amount asc) pct
from mytable
) a
where pct >= 0.99
)
select *
from sample_pop distribute by rand(1)
order by rand(1)
limit 200;