计算 Apache pig 中某个属性的 sum/avg

Cacluate the sum/avg of an attribute in Apache pig

如何计算 Apache pig 中某个属性的平均值或总和(垂直方向而非水平方向)。有很多示例可用于水平而非垂直执行此操作。

这是我的代码

f1 = LOAD '/user/maria_dev/flightdelays/flight_delays1.csv' USING 
 PigStorage(',') AS (Year:int,  ArrDelay:chararray);

f123 =  f1;
ff123 = FILTER f123 BY something;
grp = GROUP ff123 ALL;
cnt = FOREACH grp GENERATE COUNT(ff123);-- this counts the number of rows and works fine
DUMP cnt;

-- The below code is the problem
DESCRIBE grp;
cntsum = FOREACH grp GENERATE FLATTEN(ff123.ArrDelay);
DESCRIBE cntsum;

和输出:

(2008,30)
(2009,60)
(2)
 grp: {group: chararray,ff123: {(Year: int,ArrDelay: chararray)}}
cntsum: {null::ArrDelay: chararray}

但这会给我一个错误:

cntsum = FOREACH grp GENERATE SUM((int)FLATTEN(ff123.ArrDelay));
DESCRIBE cntsum;

我需要得到 90 作为输出 (30+60)

顺便问一下,这个模式作为输出是什么:

 cntsum: {null::ArrDelay: chararray}

使用 pig Apache Pig 版本 0.16.0.2.6.5.0-292

您应该将 ArrDelay 加载为 int 列

f1 = LOAD '/user/maria_dev/flightdelays/flight_delays1.csv' USING PigStorage(',') AS (Year:int,  ArrDelay:int);
ff123 = FILTER f1 BY something;
grp = GROUP ff123 ALL;
total = FOREACH grp GENERATE SUM(ff123.ArrDelay);
DUMP total;

如果这不是一个选项,则将 ArrDelay 加载到 chararray 中,然后将其在 group all 和 sum 之前进行转换

f1 = LOAD '/user/maria_dev/flightdelays/flight_delays1.csv' USING PigStorage(',') AS (Year:int,  ArrDelay:chararray);
ff123 = FILTER f1 BY something;
f2 = FOREACH ff123 GENERATE Year, (int)ArrDelay as ArrDelay;
grp = GROUP f2 ALL;
total = FOREACH grp GENERATE SUM(f2.ArrDelay);
DUMP total;