如何对猪脚本中的多列进行分组
How to Group By on multiple column in a pig script
以下 SQL 查询的 pig 等效脚本应该是什么:
SELECT fld1, fld2, fld3, SUM(fld4)
FROM Table1
GROUP BY fld1, fld2, fld3;
对于表 1:
A B C 2 X Y Z
A B C 3 X Y Z
A B D 2 X Y Z
A C D 2 X Y Z
A C D 2 X Y Z
A C D 2 X Y Z
输出:
A B C 5
A B D 2
A C D 6
Ref : https://pig.apache.org/docs/r0.11.1/basic.html#GROUP, you can
find a multi-group example
对于您的用例,下面的代码应该足够了
A = load 'input.csv' using PigStorage(',') AS (fld1:chararray,fld2:chararray,fld3:chararray,fld4:long,fld5:chararray,fld6:chararray,fld7:chararray);
B = FOREACH(GROUP A BY (fld1,fld2,fld3)) GENERATE FLATTEN(group) AS (fld1,fld2,fld3), SUM(A.fld4) AS fld4_aggr;
DUMP B;
以下 SQL 查询的 pig 等效脚本应该是什么:
SELECT fld1, fld2, fld3, SUM(fld4)
FROM Table1
GROUP BY fld1, fld2, fld3;
对于表 1:
A B C 2 X Y Z
A B C 3 X Y Z
A B D 2 X Y Z
A C D 2 X Y Z
A C D 2 X Y Z
A C D 2 X Y Z
输出:
A B C 5
A B D 2
A C D 6
Ref : https://pig.apache.org/docs/r0.11.1/basic.html#GROUP, you can find a multi-group example
对于您的用例,下面的代码应该足够了
A = load 'input.csv' using PigStorage(',') AS (fld1:chararray,fld2:chararray,fld3:chararray,fld4:long,fld5:chararray,fld6:chararray,fld7:chararray);
B = FOREACH(GROUP A BY (fld1,fld2,fld3)) GENERATE FLATTEN(group) AS (fld1,fld2,fld3), SUM(A.fld4) AS fld4_aggr;
DUMP B;