如何使用 hive/pig 查找唯一连接数
How to find number of unique connection using hive/pig
我有一个示例 table,如下所示:
caller receiver
100 200
100 300
400 100
100 200
我需要找到每个号码的唯一连接数。
例如:100 将具有 200,300 和 400 等连接。
我的输出应该是这样的:
100 3
200 1
300 1
400 1
我正在使用配置单元进行尝试。如果这不能由 hive 完成,那么可以由 pig
完成吗
这里是 a 方法来做你需要的(虽然我不完全相信它是最佳的,但我会留给你去优化)。您将需要 this jar,构建方法非常简单。
查询:
add jar ./brickhouse-0.7.1.jar; -- name and path of yours will be different
create temporary function combine_unique as 'brickhouse.udf.collect.CombineUniqueUDAF';
select connection
, size(combine_unique(arr)) c
from (
select connection, arr
from (
select caller as connection
, collect_set(receiver) arr
from some_table
group by caller ) x
union all
select connection, arr
from (
select receiver as connection
, collect_set(caller) arr
from some_table
group by receiver ) y ) f
group by connection
输出:
connection c
100 3
200 1
300 1
400 1
这将解决您的问题。
select q1.caller,count(distinct(q1.receiver)) from
(select caller , receiver from test_1 group by caller,receiver
union all
select receiver as caller,caller as receiver from test_1 group by receiver,caller)q1 group by q1.caller;
我有一个示例 table,如下所示:
caller receiver
100 200
100 300
400 100
100 200
我需要找到每个号码的唯一连接数。 例如:100 将具有 200,300 和 400 等连接。
我的输出应该是这样的:
100 3
200 1
300 1
400 1
我正在使用配置单元进行尝试。如果这不能由 hive 完成,那么可以由 pig
完成吗这里是 a 方法来做你需要的(虽然我不完全相信它是最佳的,但我会留给你去优化)。您将需要 this jar,构建方法非常简单。
查询:
add jar ./brickhouse-0.7.1.jar; -- name and path of yours will be different
create temporary function combine_unique as 'brickhouse.udf.collect.CombineUniqueUDAF';
select connection
, size(combine_unique(arr)) c
from (
select connection, arr
from (
select caller as connection
, collect_set(receiver) arr
from some_table
group by caller ) x
union all
select connection, arr
from (
select receiver as connection
, collect_set(caller) arr
from some_table
group by receiver ) y ) f
group by connection
输出:
connection c
100 3
200 1
300 1
400 1
这将解决您的问题。
select q1.caller,count(distinct(q1.receiver)) from
(select caller , receiver from test_1 group by caller,receiver
union all
select receiver as caller,caller as receiver from test_1 group by receiver,caller)q1 group by q1.caller;