Pig 中的 SUM、AVG 不起作用

SUM, AVG, in Pig are not working

我正在用 pig 中的以下代码分析集群用户日志文件:

     t_data = load 'log_flies/*' using PigStorage(',');
    A = foreach t_data generate [=10=] as (jobid:int), 
 as (indexid:int),  as (clusterid:int),  as (user:chararray),
  as (stat:chararray),  as (queue:chararray),  as (projectName:chararray),  as (cpu_used:float),  as (efficiency:float),   as (numThreads:int), 

 as (numNodes:int),   as (numCPU:int), as (comTime:int),
  as (penTime:int),   as (runTime:int), /(*) as (allEff: float), SUBSTRING(, 0, 11) as (endTime: chararray);
    ---describe A;
    A = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
    B = group A by user;
    f_data = foreach B {
           grp = group;
           count = COUNT(A);
          avg = AVG(A.cpu_used);
          generate FLATTEN(grp), count, avg;
       };
    f_data = limit f_data 10;
    dump f_data;

代码适用于 group and COUNT,但是当我包含 AVG 和 SUM 时,它显示错误:

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias f_data

我检查了数据类型。一切都很好。你对我错过的地方有什么建议吗?预先感谢您的帮助。

语法错误。阅读 http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#more_on_foreach(部分:嵌套 foreach)了解详细信息。

猪文

   A = LOAD 'a.csv' USING  PigStorage(',') AS (user:chararray,    cpu_used:float);
   B = GROUP A BY user;
   C = FOREACH B {
    cpu_used_bag = A.cpu_used;
    GENERATE group AS user, AVG(cpu_used_bag) AS avg_cpu_used, SUM(cpu_used_bag) AS total_cpu_used;
    };

输入: a.csv

a,3
a,4
b,5

输出:

(a,3.5,7.0)
(b,5.0,5.0)

你的猪毛病多多

  • 不要在 = 的两边使用相同的别名;
  • 将 PigLoader() 用作(适当提及您的模式);

    A = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
    

    将此更改为 F = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, 运行Time, allEff, endTime;

    f_data = 限制 f_data 10; CHANGE 用其他名字留下 F_data。

    不要让你的生活变得复杂。 调试 Pigscript 的一般规则

    • 运行 在本地模式下
    • 每行后转储

    写了一个样本猪来模仿你的猪:(工作)

t_data = load './file' using PigStorage(',') as (jobid:int,cpu_used:float);

        C = foreach t_data generate jobid, cpu_used ;
        B = group C by jobid ;
        f_data = foreach B {
               count = COUNT(C);
              sum = SUM(C.cpu_used);
              avg = AVG(C.cpu_used);
              generate FLATTEN(group), count,sum,avg;
           };
        never_f_data = limit f_data 10;

    dump never_f_data;