Apache PIG-将当前行的日期设置为下一个记录日期减去给定 id 的一天

Apache PIG- set date of current row as next records date minus one day for a given id

我需要将 end_dt 设置为给定 ID 的下一条记录 effective_dt 减去 1 天,并将其默认设置为 9999-12-31 作为猪中给定 ID 的最后一条记录。

输入数据-

id     eff_dt      end_dt
1    2012-02-28   9999-12-31
1    2013-03-15   9999-12-31
1    2014-05-01   9999-12-31

所需结果-(按eff_dt排序,然后得到下一条记录)

id     eff_dt       end_dt
1    2012-02-28    2013-02-14
1    2013-03-15    2014-04-30
1    2014-05-01    9999-12-31

我是 apache PIG 的新手,发现我们可以使用 lead/lag、stitch/flatten 但不知道如何在脚本中使用它来实现上述结果。我面临的问题很少。

Issue 1 :- PIG accepts date as chararray. Need to convert eff_dt into date.
Issue 2 :- want to know syntax for 'date minus 1 day'.
Issue 3 :- How to use lead lag to get next record and do a minus one day and default if there is no next record.

从 apache pig 站点获得以下示例代码,但不知道如何转换它以在我的用例中使用它。:-

要查找当前记录前面的第 3 条记录,请在当前行和前面 3 条记录之间使用 window,默认值为 0。

 A = load 'T';
 B = group A by si;
 C = foreach B {
     C1 = order A by i;
     generate flatten(Stitch(C1, Over(C1.i, 'lead', 0, 3, 3, 0)));
 }
 D = foreach C generate s, ;

这相当于SQL语句

select s, lead(i, 3, 0) over (partition by si order by i rows between current row and 3 following) over T;

任何帮助将不胜感激。

你有3个问题,我目前只能回答前两个:

如何将 yyyy-mm-dd 转换为日期并减去一天:

dataB = FOREACH data { 
    date = ToDate(eff_dt, 'yyyy-MM-dd');
    dayBefore = SubtractDuration(date, 'P1D');
    dayBeforeFormated = ToString(dayBefore, 'yyyy-MM-dd');

    GENERATE eff_dt, dayBeforeFormated;
}

我终于有机会尝试 piggybank 的 Over and Stich 方法。这是一个可行的解决方案。

-- first load the piggybank and define shorthand to Over and Stitch functions
REGISTER '/data/lib/piggybank-0.12.0.jar';
DEFINE Over org.apache.pig.piggybank.evaluation.Over();
DEFINE Stitch org.apache.pig.piggybank.evaluation.Stitch();

-- load the input data
data = LOAD '/data' USING PigStorage('\t') AS (id:int, eff_dt:chararray);

-- generate the previous date (that could be done later)
data_before = FOREACH data { 
    date = ToDate(eff_dt, 'yyyy-MM-dd');
    dayBefore = SubtractDuration(date, 'P1D');
    eff_before = ToString(dayBefore, 'yyyy-MM-dd');
    GENERATE id as id, eff_dt as eff_dt, eff_before as eff_before;
}

-- Stitch join two bags based on position
-- Over apply a function on a group. Here we use the lead operator to get the next tuple
data_over = FOREACH (GROUP data_before ALL) {
    out = Stitch(data_before, Over(data_before.eff_before, 'lead', 0, 1, 1, '9999-99-99'));
    GENERATE FLATTEN(out) as (id, eff_dt, eff_before, end_dt);
}

-- finally, we output (we could have transform the date here)
data_final = FOREACH data_over GENERATE id, eff_dt, end_dt;

这个脚本的输出是:

(1,2012-02-28,2013-03-14)
(1,2013-03-15,2014-04-30)
(1,2014-05-01,9999-99-99)