Apache Pig:包中最常见的值

Apache Pig: Most frequent value in bag

我的数据是这样的:

SWE "{(Figure Skating),(Tennis),(Tennis)}"
GER "{(Figure Skating),(Figure Skating)}"

我想制作这个:

SWE Tennis
GER "Figure Skating"

关系符号:x
字段 #1 的符号:NOC
字段 #2 的符号:sports

显而易见的想法是生成计数并按最大计数进行过滤,但我什至不知道如何遍历字段 sports。这是如何实现的?

我会推荐使用 DataFu CountEach UDF to count the instances of each sport in the bag. You can then find the highest count in each bag. One way to do this is to order the 'sports' bags by the counts then take the first tuple from each bag, using the FirstTupleFromBag UDF

我在展平模式下使用了 CountEach,因为这意味着我们不会在结果中包含运动名称 'nested',但如果您愿意,可以定义不带 'flatten' 的 UDF。

DEFINE CountEachFlatten datafu.pig.bags.CountEach('flatten');
DEFINE FirstTupleFromBag datafu.pig.bags.FirstTupleFromBag();

sports_counted = FOREACH x GENERATE
    NOC,
    CountEachFlatten(sports) AS sports:{(sport_name, sport_count)};

max_sports = FOREACH sports_counted {
    ordered_sports = ORDER sports BY sport_count DESC;
    GENERATE
    NOC,
    FirstTupleFromBag(ordered_sports, null);
}