Pig - 分组后 MAX 不工作
Pig - MAX is not working after grouping
我正在与 Pig 0.12.1
和 Map-R
合作。在对其他字段上的关系进行分组后,我试图找到字段的 max。在评论中参考以下猪脚本和关系结构-
r1 = foreach SomeRelation generate flatten(group) as (c1 , c2);
-- r1: {c1: biginteger,c2: biginteger}
r2 = group r1 by c1;
-- r2: {group: chararray,r1: {(c1: chararray,c2: biginteger)}}
DUMP r2;
/* output -
1234|{(1234,9876)}
2345|{(2345,8765)}
3456|{(3456,7654)}
4567|{(4567,6543)}
*/
r3 = foreach r2 generate group as c1, MAX(r1.c2) as c2;
我收到以下错误
Could not infer the matching function for org.apache.pig.builtin.MAX as multiple or none of them fit. Please use an explicit cast.
脚本解释-
I am flattening group of SomeRelation into c1, c2 and then regrouping
on c1 to generate max of c2 with each c1 group.
请推荐。
我不确定是否可以在展平下使用组关键字。此外,您是否考虑过在展平该组之前对其进行标记化。例如看这个:
load_data = LOAD '/PIG_TESTS_ALL/WordCount' as (line);
tokenizing_data = FOREACH load_data generate flatten(TOKENIZE(line)) as word;
group_data = GROUP tokenizing_data by word;
Result = FOREACH group_data generate group,COUNT(tokenizing_data);
dump Result;
这实际上是为了字数统计,您可以在此基础上根据您想要做的事情找到最大值。
我们现在知道问题出在 MAX 无法处理大整数。
您应该能够像这样分组并获得最大值,并将结果与 order + limit 的组合进行比较:
r1 = FOREACH SomeRelation GENERATE FLATTEN(group) AS (c1, c2);
r3 = FOREACH (group r1 by c1) {
-- you may want to apply a function on a single column
-- or compare sort + limit to MAX
list = ORDER BY c2 DESC;
list_max = LIMIT list 1;
GENERATE group AS c1, MAX(r1.c2) AS c2, list_max;
}
好吧,看起来问题是 Pig 不允许 MAX(或者就此而言,SUM 等聚合函数)在大整数上。必须使用 long 作为数据类型才能工作。参考以下-
r1 = foreach SomeRelation generate flatten(group) as (c1 , c2:long);
-- r1: {c1: biginteger,c2: long}
奇怪的是,几乎没有像数据类型 biginteger 和 bigdecimal 那样强调这一点的文档。
我正在与 Pig 0.12.1
和 Map-R
合作。在对其他字段上的关系进行分组后,我试图找到字段的 max。在评论中参考以下猪脚本和关系结构-
r1 = foreach SomeRelation generate flatten(group) as (c1 , c2);
-- r1: {c1: biginteger,c2: biginteger}
r2 = group r1 by c1;
-- r2: {group: chararray,r1: {(c1: chararray,c2: biginteger)}}
DUMP r2;
/* output -
1234|{(1234,9876)}
2345|{(2345,8765)}
3456|{(3456,7654)}
4567|{(4567,6543)}
*/
r3 = foreach r2 generate group as c1, MAX(r1.c2) as c2;
我收到以下错误
Could not infer the matching function for org.apache.pig.builtin.MAX as multiple or none of them fit. Please use an explicit cast.
脚本解释-
I am flattening group of SomeRelation into c1, c2 and then regrouping on c1 to generate max of c2 with each c1 group.
请推荐。
我不确定是否可以在展平下使用组关键字。此外,您是否考虑过在展平该组之前对其进行标记化。例如看这个:
load_data = LOAD '/PIG_TESTS_ALL/WordCount' as (line);
tokenizing_data = FOREACH load_data generate flatten(TOKENIZE(line)) as word;
group_data = GROUP tokenizing_data by word;
Result = FOREACH group_data generate group,COUNT(tokenizing_data);
dump Result;
这实际上是为了字数统计,您可以在此基础上根据您想要做的事情找到最大值。
我们现在知道问题出在 MAX 无法处理大整数。
您应该能够像这样分组并获得最大值,并将结果与 order + limit 的组合进行比较:
r1 = FOREACH SomeRelation GENERATE FLATTEN(group) AS (c1, c2);
r3 = FOREACH (group r1 by c1) {
-- you may want to apply a function on a single column
-- or compare sort + limit to MAX
list = ORDER BY c2 DESC;
list_max = LIMIT list 1;
GENERATE group AS c1, MAX(r1.c2) AS c2, list_max;
}
好吧,看起来问题是 Pig 不允许 MAX(或者就此而言,SUM 等聚合函数)在大整数上。必须使用 long 作为数据类型才能工作。参考以下-
r1 = foreach SomeRelation generate flatten(group) as (c1 , c2:long);
-- r1: {c1: biginteger,c2: long}
奇怪的是,几乎没有像数据类型 biginteger 和 bigdecimal 那样强调这一点的文档。