计算 Apache pig 中某个属性的 sum/avg
Cacluate the sum/avg of an attribute in Apache pig
如何计算 Apache pig 中某个属性的平均值或总和(垂直方向而非水平方向)。有很多示例可用于水平而非垂直执行此操作。
这是我的代码
f1 = LOAD '/user/maria_dev/flightdelays/flight_delays1.csv' USING
PigStorage(',') AS (Year:int, ArrDelay:chararray);
f123 = f1;
ff123 = FILTER f123 BY something;
grp = GROUP ff123 ALL;
cnt = FOREACH grp GENERATE COUNT(ff123);-- this counts the number of rows and works fine
DUMP cnt;
-- The below code is the problem
DESCRIBE grp;
cntsum = FOREACH grp GENERATE FLATTEN(ff123.ArrDelay);
DESCRIBE cntsum;
和输出:
(2008,30)
(2009,60)
(2)
grp: {group: chararray,ff123: {(Year: int,ArrDelay: chararray)}}
cntsum: {null::ArrDelay: chararray}
但这会给我一个错误:
cntsum = FOREACH grp GENERATE SUM((int)FLATTEN(ff123.ArrDelay));
DESCRIBE cntsum;
我需要得到 90 作为输出 (30+60)
顺便问一下,这个模式作为输出是什么:
cntsum: {null::ArrDelay: chararray}
使用 pig Apache Pig 版本 0.16.0.2.6.5.0-292
您应该将 ArrDelay 加载为 int 列
f1 = LOAD '/user/maria_dev/flightdelays/flight_delays1.csv' USING PigStorage(',') AS (Year:int, ArrDelay:int);
ff123 = FILTER f1 BY something;
grp = GROUP ff123 ALL;
total = FOREACH grp GENERATE SUM(ff123.ArrDelay);
DUMP total;
如果这不是一个选项,则将 ArrDelay 加载到 chararray 中,然后将其在 group all 和 sum 之前进行转换
f1 = LOAD '/user/maria_dev/flightdelays/flight_delays1.csv' USING PigStorage(',') AS (Year:int, ArrDelay:chararray);
ff123 = FILTER f1 BY something;
f2 = FOREACH ff123 GENERATE Year, (int)ArrDelay as ArrDelay;
grp = GROUP f2 ALL;
total = FOREACH grp GENERATE SUM(f2.ArrDelay);
DUMP total;
如何计算 Apache pig 中某个属性的平均值或总和(垂直方向而非水平方向)。有很多示例可用于水平而非垂直执行此操作。
这是我的代码
f1 = LOAD '/user/maria_dev/flightdelays/flight_delays1.csv' USING
PigStorage(',') AS (Year:int, ArrDelay:chararray);
f123 = f1;
ff123 = FILTER f123 BY something;
grp = GROUP ff123 ALL;
cnt = FOREACH grp GENERATE COUNT(ff123);-- this counts the number of rows and works fine
DUMP cnt;
-- The below code is the problem
DESCRIBE grp;
cntsum = FOREACH grp GENERATE FLATTEN(ff123.ArrDelay);
DESCRIBE cntsum;
和输出:
(2008,30)
(2009,60)
(2)
grp: {group: chararray,ff123: {(Year: int,ArrDelay: chararray)}}
cntsum: {null::ArrDelay: chararray}
但这会给我一个错误:
cntsum = FOREACH grp GENERATE SUM((int)FLATTEN(ff123.ArrDelay));
DESCRIBE cntsum;
我需要得到 90 作为输出 (30+60)
顺便问一下,这个模式作为输出是什么:
cntsum: {null::ArrDelay: chararray}
使用 pig Apache Pig 版本 0.16.0.2.6.5.0-292
您应该将 ArrDelay 加载为 int 列
f1 = LOAD '/user/maria_dev/flightdelays/flight_delays1.csv' USING PigStorage(',') AS (Year:int, ArrDelay:int);
ff123 = FILTER f1 BY something;
grp = GROUP ff123 ALL;
total = FOREACH grp GENERATE SUM(ff123.ArrDelay);
DUMP total;
如果这不是一个选项,则将 ArrDelay 加载到 chararray 中,然后将其在 group all 和 sum 之前进行转换
f1 = LOAD '/user/maria_dev/flightdelays/flight_delays1.csv' USING PigStorage(',') AS (Year:int, ArrDelay:chararray);
ff123 = FILTER f1 BY something;
f2 = FOREACH ff123 GENERATE Year, (int)ArrDelay as ArrDelay;
grp = GROUP f2 ALL;
total = FOREACH grp GENERATE SUM(f2.ArrDelay);
DUMP total;