计算猪查询中的分组记录
Count the grouped records in pig query
下面是我的测试数据
John,q1,Correct
Jack,q1,wrong
John,q2,Correct
Jack,q2,wrong
John,q3,wrong
Jack,q3,Correct
John,q4,wrong
Jack,q4,wrong
John,q5,wrong
Jack,q5,wrong
我想找到如下内容:
John wrong 4
John correct 1
Jack wrong 3
Jack correct 2
我的代码:
data = LOAD '/Whosebugq4.txt' USING PigStorage(',') AS (
name:chararray,
number:chararray,
result:chararray);
B = GROUP data by (name,result);
现在输出如下所示:
((John,wrong),{(John,q5,wrong),(John,q4,wrong),(John,q2,wrong),(John,q1,wrong)})
((John,Correct),{(John,q3,Correct)})
((Jack,wrong),{(Jack,q5,wrong),(Jack,q4,wrong),(Jack,q3,wrong)})
((Jack,Correct),{(Jack,q2,Correct),(Jack,q1,Correct)})
我应该如何计算分组的记录数。
COUNT
函数会给你一个包里元素的数量,这正是你想要的。按 user
和 result
分组后,您最终会得到一个包,其中包含每个组合出现的次数。
因此,您只需添加一行:
data = LOAD '/Whosebugq4.txt' USING PigStorage(',') AS (
name:chararray,
number:chararray,
result:chararray);
B = GROUP data by (name,result);
C = foreach B generate FLATTEN(group) as (name,result), COUNT(data) as count;
dump D;
(Jack,wrong,4)
(Jack,Correct,1)
(John,wrong,3)
(John,Correct,2)
FLATTEN(group)
是因为在分组之后,生成了一个包含您分组依据的元素的元组,并且从您想要作为输出的内容来看,您不希望它在元组内,因为输出就像 ((Jack,wrong),4)
.
下面是我的测试数据
John,q1,Correct
Jack,q1,wrong
John,q2,Correct
Jack,q2,wrong
John,q3,wrong
Jack,q3,Correct
John,q4,wrong
Jack,q4,wrong
John,q5,wrong
Jack,q5,wrong
我想找到如下内容:
John wrong 4
John correct 1
Jack wrong 3
Jack correct 2
我的代码:
data = LOAD '/Whosebugq4.txt' USING PigStorage(',') AS (
name:chararray,
number:chararray,
result:chararray);
B = GROUP data by (name,result);
现在输出如下所示:
((John,wrong),{(John,q5,wrong),(John,q4,wrong),(John,q2,wrong),(John,q1,wrong)})
((John,Correct),{(John,q3,Correct)})
((Jack,wrong),{(Jack,q5,wrong),(Jack,q4,wrong),(Jack,q3,wrong)})
((Jack,Correct),{(Jack,q2,Correct),(Jack,q1,Correct)})
我应该如何计算分组的记录数。
COUNT
函数会给你一个包里元素的数量,这正是你想要的。按 user
和 result
分组后,您最终会得到一个包,其中包含每个组合出现的次数。
因此,您只需添加一行:
data = LOAD '/Whosebugq4.txt' USING PigStorage(',') AS (
name:chararray,
number:chararray,
result:chararray);
B = GROUP data by (name,result);
C = foreach B generate FLATTEN(group) as (name,result), COUNT(data) as count;
dump D;
(Jack,wrong,4)
(Jack,Correct,1)
(John,wrong,3)
(John,Correct,2)
FLATTEN(group)
是因为在分组之后,生成了一个包含您分组依据的元素的元组,并且从您想要作为输出的内容来看,您不希望它在元组内,因为输出就像 ((Jack,wrong),4)
.