在组中查找前 n-th 次出现,Hive
Finding top n-th occurrences in group, Hive
我有一个 table,其中每条记录都有列:标题和类别。
我想找到 2 个在其类别中出现次数最多的标题。一些标题同时列在两个类别中。如何在 Hive 中实现这一点?
这是一个 table 创建查询:
create table book(category String, title String) row format delimited fields terminated by '\t' stored as textfile;
和示例数据:
fiction book1
fiction book2
fiction book3
fiction book4
fiction book5
fiction book6
fiction book7
fiction book8
fiction book8
fiction book8
psychology book1
psychology book2
psychology book2
psychology book2
psychology book2
psychology book7
psychology book7
psychology book7
预期结果:
fiction book8
fiction any other
psychology book2
psychology book7
目前我已经设法编写了这个查询:
SELECT * FROM
(SELECT category, title,
count(*) as sale_count
from book
Group BY category, title) a
order by category, sale_count DESC;
这给出了每个类别中标题的计数,但我找不到方法 return 每个类别中只有 2 个顶级记录
只有两个最高记录使用 row_number()
select category, title, sale_count
from
(
SELECT a.*,
row_number() over(partition by category order by sale_count desc) rn
FROM
(SELECT category, title,
count(*) as sale_count
from book
Group BY category, title) a
)s where rn <=2
order by category, sale_count DESC;
并且如果有多行具有相同的最高销售额并且您需要 return 所有最高销售额行以获得两个最高计数,请使用 DENSE_RANK
而不是 row_number
,如果有相同 sale_count.
的标题,它将分配相同的等级
我有一个 table,其中每条记录都有列:标题和类别。 我想找到 2 个在其类别中出现次数最多的标题。一些标题同时列在两个类别中。如何在 Hive 中实现这一点?
这是一个 table 创建查询:
create table book(category String, title String) row format delimited fields terminated by '\t' stored as textfile;
和示例数据:
fiction book1
fiction book2
fiction book3
fiction book4
fiction book5
fiction book6
fiction book7
fiction book8
fiction book8
fiction book8
psychology book1
psychology book2
psychology book2
psychology book2
psychology book2
psychology book7
psychology book7
psychology book7
预期结果:
fiction book8
fiction any other
psychology book2
psychology book7
目前我已经设法编写了这个查询:
SELECT * FROM
(SELECT category, title,
count(*) as sale_count
from book
Group BY category, title) a
order by category, sale_count DESC;
这给出了每个类别中标题的计数,但我找不到方法 return 每个类别中只有 2 个顶级记录
只有两个最高记录使用 row_number()
select category, title, sale_count
from
(
SELECT a.*,
row_number() over(partition by category order by sale_count desc) rn
FROM
(SELECT category, title,
count(*) as sale_count
from book
Group BY category, title) a
)s where rn <=2
order by category, sale_count DESC;
并且如果有多行具有相同的最高销售额并且您需要 return 所有最高销售额行以获得两个最高计数,请使用 DENSE_RANK
而不是 row_number
,如果有相同 sale_count.