使用 Pig 获取唯一记录的价值

Get value for unique record using Pig

下面是输入数据集

col1,col2,col3,col4,col5

key1,111,1,12/11/2016,10
key2,111,1,12/11/2016,10
key3,111,1,12/11/2016,10
key4,222,2,12/22/2016,10
key5,222,2,12/22/2016,10
key6,333,3,12/30/2016,10
key7,111,0,12/11/2016,10

基于 col2、col3、col4 将给出唯一记录,我需要从 col1 中获取任何一个值作为唯一记录,并填充为新字段 col6。预期输出低于

col1,col2,col3,col4,col5,col6

key1,111,1,12/11/2016,10,key3
key2,111,1,12/11/2016,10,key3
key3,111,1,12/11/2016,10,key3
key4,222,2,12/22/2016,10,key5
key5,222,2,12/22/2016,10,key5
key6,333,3,12/30/2016,10,key6
key7,111,0,12/11/2016,10,key7

下面是脚本,我遇到了错误。

A = load 'test1.csv' using PigStorage(',');
B = GROUP A by (,,);
C = FOREACH B GENERATE FLATTEN(group), MAX(A.[=12=]);

错误org.apache.pig.tools.grunt.Grunt - 错误 2106:执行代数函数时出错

看起来是使用 Nested Foreach 的好用例

参考:https://pig.apache.org/docs/r0.14.0/basic.html#foreach

输入:

key1,111,1,12/11/2016,10
key2,111,1,12/11/2016,10
key3,111,1,12/11/2016,10
key4,222,2,12/22/2016,10
key5,222,2,12/22/2016,10
key6,333,3,12/30/2016,10
key7,111,0,12/11/2016,10

PigScript

A = load 'input.csv' using PigStorage(',')  AS (col1:chararray,col2:chararray,col3:chararray,col4:chararray,col5:chararray);
B = FOREACH(GROUP A BY (col2,col3,col4)) {
    ordered = ORDER A BY col1 DESC;
    latest = LIMIT ordered 1;
    GENERATE FLATTEN(A) AS (col1:chararray,col2:chararray,col3:chararray,col4:chararray,col5:chararray), FLATTEN(latest.col1) AS col6:chararray;
};

DUMP B;

输出:

(key1,111,1,12/11/2016,10,key3)
(key2,111,1,12/11/2016,10,key3)
(key3,111,1,12/11/2016,10,key3)
(key4,222,2,12/22/2016,10,key5)
(key5,222,2,12/22/2016,10,key5)
(key6,333,3,12/30/2016,10,key6)
(key7,111,0,12/11/2016,10,key7)