猪:分组数据时出现投射错误

Pig: Cast error while grouping data

这是我正在尝试的代码 运行。步骤:

  1. 进行输入(输入文件夹中有一个.pig_schema文件)
  2. 只从中取出两个字段(chararray)并删除重复项
  3. 在其中一个字段上分组

代码如下:

x = LOAD '$input' USING PigStorage('\t'); --The input is tab separated
x = LIMIT x 25;
DESCRIBE x;
-- Output of DESCRIBE x:
-- x: {id: chararray,keywords: chararray,score: chararray,time: long}

distinctCounts = FOREACH x GENERATE keywords, id; -- generate two fields
distinctCounts = DISTINCT distinctCounts; -- remove duplicates
DESCRIBE distinctCounts;
-- Output of DESCRIBE distinctCounts;
-- distinctCounts: {keywords: chararray,id: chararray}

grouped = GROUP distinctCounts BY keywords; --group by keywords
DESCRIBE grouped; --THIS IS WHERE IT GIVES AN ERROR
DUMP grouped;

当我进行分组时,出现以下错误:

ERROR org.apache.pig.tools.pigstats.SimplePigStats  - 
ERROR: org.apache.pig.data.DataByteArray cannot be cast to java.lang.String

关键字是一个字符数组,Pig 应该能够在字符数组上进行分组。有什么想法吗?

编辑: 输入文件:

0000010000014743       call for midwife    23      1425761139
0000010000062069       naruto 1    56      1425780386
0000010000079919       the following    98     1425788874
0000010000081650       planes 2    76      1425721945
0000010000118785       law and order    21     1425763899
0000010000136965       family guy    12    1425766338
0000010000136100       american dad    19      1425766702

.pig_schema 文件

{"fields":[{"name":"id","type":55},{"name":"keywords","type":55},{"name":"score","type":55},{"name":"time","type":15}]}

Pig 无法识别关键字的值,因为 chararray.Its 更好地在初始加载期间进行字段命名,这样我们就明确说明了字段类型。

x = LOAD '$input' USING PigStorage('\t') AS (id:chararray,keywords:chararray,score: chararray,time: long);

更新:

尝试使用更新后的以下代码段。pig_schema 引入分数,使用“\t”作为分隔符并尝试以下步骤来共享输入。

  x = LOAD 'a.csv' USING PigStorage('\t'); 
 distinctCounts = FOREACH x GENERATE keywords, id; 
 distinctCounts = DISTINCT distinctCounts;
 grouped = GROUP distinctCounts BY keywords; 
 DUMP grouped;

建议使用唯一的别名以获得更好的可读性和可维护性。

输出:

    (naruto 1,{(naruto 1,0000010000062069)})
    (planes 2,{(planes 2,0000010000081650)})
    (family guy,{(family guy,0000010000136965)})
    (american dad,{(american dad,0000010000136100)})
    (law and order,{(law and order,0000010000118785)})
    (the following,{(the following,0000010000079919)})
    (call for midwife,{(call for midwife,0000010000014743)})