是否可以在配置单元中执行 'normalized' dense_rank() ？

Question

我有一个这样的消费者table。

consumer | product | quantity
-------- | ------- | --------
a        | x       | 3
a        | y       | 4
a        | z       | 1
b        | x       | 3
b        | y       | 5
c        | x       | 4

我想要的是分配给每个消费者的 'normalized' 排名，这样我就可以轻松拆分 table 进行测试和培训。我在配置单元中使用了 dense_rank()，所以我得到了下面的 table.

rank | consumer | product | quantity
---- | -------- | ------- | --------
1    | a        | x       | 3
1    | a        | y       | 4
1    | a        | z       | 1
2    | b        | x       | 3
2    | b        | y       | 5
3    | c        | x       | 4

这很好，但我想扩展它以用于任意数量的消费者，所以我希望排名范围介于 0 和 1 之间，就像这样。

rank | consumer | product | quantity
---- | -------- | ------- | --------
0.33 | a        | x       | 3
0.33 | a        | y       | 4
0.33 | a        | z       | 1
0.67 | b        | x       | 3
0.67 | b        | y       | 5
1    | c        | x       | 4

这样，我总是知道排名的范围是多少，并且可以以标准方式拆分数据（排名 <= 0.7 训练，排名 > 0.7 测试）

有没有办法在 hive 中实现这个？

或者，对于我最初的拆分数据问题，是否有更好的不同方法？

我尝试执行 select * where rank < 0.7*max(rank)，但配置单元说 MAX UDAF 在 where 子句中尚不可用。

Answer 1

percent_rank

select  percent_rank() over (order by consumer) as pr
       ,* 

from    mytable
;

+-----+----------+---------+----------+
| pr  | consumer | product | quantity |
+-----+----------+---------+----------+
| 0.0 | a        | z       |        1 |
| 0.0 | a        | y       |        4 |
| 0.0 | a        | x       |        3 |
| 0.6 | b        | y       |        5 |
| 0.6 | b        | x       |        3 |
| 1.0 | c        | x       |        4 |
+-----+----------+---------+----------+

要进行过滤，您需要一个子查询/CTE

select  *

from   (select  percent_rank() over (order by consumer) as pr
               ,* 

        from    mytable
        ) t

where   pr <= ...
;

是否可以在配置单元中执行 'normalized' dense_rank() ？

Is it possible to do a 'normalized' dense_rank() in hive?

hadoop

hive

machine-learning

training-data