如何获得猪行的平均值
How to get average with rows in pig
经过以下处理
REGISTER 's3://jmh-dtg-2016/jeon_dtg/test.py' USING jython as test;
raw01 = LOAD 's3://jmh-dtg-2016/jeon_dtg/test_pig.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER');
raw02 = FOREACH raw01 GENERATE (chararray) as date, (chararray) as code, (chararray) as car_num, (chararray) as pre_time, (FLOAT) as vel, (chararray) as link_id;
raw03 = GROUP raw02 BY (link_id, car_num);
raw04 = FOREACH raw03 GENERATE group, test.my_fun(raw02.vel) AS val;
dump raw04;
得到这些结果
enter image description here
我想得到每行的平均值。
总之,我想要这样的结果:
{(39.0),(45.0)}) -> 42
{(1.0)}) -> 1
这是我用的python函数。
@outputSchema('num01:float')
def my_fun(data01):
a = data01
b = sorted(a)
c = int((len(b)/100.0) * 10.0)
d = int((len(b)/100.0) * 90.0)
e = b[c:d]
return e
而且不可能
@outputSchema('num01:float')
def my_fun(data01):
a = data01
b = sorted(a)
c = int((len(b)/100.0) * 10.0)
d = int((len(b)/100.0) * 90.0)
e = b[c:d]
return sum(e)
请帮帮我..
听起来您只需要从一袋值中取平均值?如果我错了纠正我。 PIG 运算符 AVG 应该这样做,并且比 Python UDF 的性能更高。
raw04 = FOREACH raw03 GENERATE group, AVG(raw02.vel) AS val;
经过以下处理
REGISTER 's3://jmh-dtg-2016/jeon_dtg/test.py' USING jython as test;
raw01 = LOAD 's3://jmh-dtg-2016/jeon_dtg/test_pig.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER');
raw02 = FOREACH raw01 GENERATE (chararray) as date, (chararray) as code, (chararray) as car_num, (chararray) as pre_time, (FLOAT) as vel, (chararray) as link_id;
raw03 = GROUP raw02 BY (link_id, car_num);
raw04 = FOREACH raw03 GENERATE group, test.my_fun(raw02.vel) AS val;
dump raw04;
得到这些结果
enter image description here
我想得到每行的平均值。 总之,我想要这样的结果:
{(39.0),(45.0)}) -> 42
{(1.0)}) -> 1
这是我用的python函数。
@outputSchema('num01:float')
def my_fun(data01):
a = data01
b = sorted(a)
c = int((len(b)/100.0) * 10.0)
d = int((len(b)/100.0) * 90.0)
e = b[c:d]
return e
而且不可能
@outputSchema('num01:float')
def my_fun(data01):
a = data01
b = sorted(a)
c = int((len(b)/100.0) * 10.0)
d = int((len(b)/100.0) * 90.0)
e = b[c:d]
return sum(e)
请帮帮我..
听起来您只需要从一袋值中取平均值?如果我错了纠正我。 PIG 运算符 AVG 应该这样做,并且比 Python UDF 的性能更高。
raw04 = FOREACH raw03 GENERATE group, AVG(raw02.vel) AS val;