查找数据中前 10 次出现的次数

Question

我正在尝试在我的 Twitter 数据中查找前 10 次提及 (@xxxxx)。我已经创建了初始 table twitter.full_text_ts 并加载了我的数据。

create table twitter.full_text_ts as
select id, cast(concat(substr(ts,1,10), ' ', substr(ts,12,8)) as timestamp) as  ts, lat, lon, tweet
from full_text;

我能够使用此查询（模式）提取推文中的提及内容

select id, ts, regexp_extract(lower(tweet), '(.*)@user_(\S{8})([:| ])(.*)',2) as patterns
from twitter.full_text_ts
order by patterns desc
limit 50;

执行这个给了我

USER_a3ed4b5a   2010-03-07 03:46:23 fffed220
USER_dc8cfa6f   2010-03-05 18:28:39 fffdabf9
USER_dc8cfa6f   2010-03-05 18:32:55 fffdabf9
USER_915e3f8c   2010-03-07 03:39:09 fffdabf9
and so on...

可以看到fffed220等是提取出来的模式

现在我想做的是计算每个提及（模式）出现的次数并输出前 10 次。例如 fffdabf9 出现 20 次，fffxxxx 出现 17 次等等。

Answer 1

with mentions as 
(select id, ts, 
 regexp_extract(lower(tweet), '(.*)@user_(\S{8})([:| ])(.*)',2) as patterns
from twitter.full_text_ts
order by patterns desc
limit 50)
select patterns, count(*) 
from mentions
group by patterns
order by count(*) desc
limit 10;

Answer 2

最易读的方法是将您的第一个查询保存到临时 table，然后对临时 table:

进行分组

create table tmp as
--your query

select patterns, count(*) n_mentions
from tmp
group by patterns
order by count(*) desc
limit 10;

查找数据中前 10 次出现的次数

Finding top 10 occurrences in data

sql

hadoop

hiveql