如何使用 hive/pig 查找唯一连接数

How to find number of unique connection using hive/pig

我有一个示例 table,如下所示:

caller   receiver 
100         200
100         300
400         100
100         200

我需要找到每个号码的唯一连接数。 例如:100 将具有 200,300 和 400 等连接。

我的输出应该是这样的:

100      3  
200      1  
300      1  
400      1

我正在使用配置单元进行尝试。如果这不能由 hive 完成,那么可以由 pig

完成吗

这里是 a 方法来做你需要的(虽然我不完全相信它是最佳的,但我会留给你去优化)。您将需要 this jar,构建方法非常简单。

查询:

add jar ./brickhouse-0.7.1.jar; -- name and path of yours will be different
create temporary function combine_unique as 'brickhouse.udf.collect.CombineUniqueUDAF';

select connection
  , size(combine_unique(arr)) c
from (
  select connection, arr
  from (
    select caller as connection
      , collect_set(receiver) arr
    from some_table
    group by caller ) x
  union all
  select connection, arr
  from (
    select receiver as connection
      , collect_set(caller) arr
    from some_table
    group by receiver ) y ) f
group by connection

输出:

connection    c
100           3
200           1
300           1
400           1

这将解决您的问题。

 select q1.caller,count(distinct(q1.receiver)) from 
(select caller , receiver from test_1 group by caller,receiver 
union all 
select receiver as caller,caller as receiver from test_1 group by receiver,caller)q1 group by q1.caller;