我如何 "group by" 使用一个列而不显示它？

Question

所以我有一个名为“students.txt”的输入文件，其中包含以下结构：id, first name, last name, date of birth。这是它的内容：

111111 Harry Cover 28/01/1986
222222 John Doeuf 03/01/1996
333333 Jacques Selere 18/07/1998
444444 Jean Breille 06/08/1991

我正在尝试创建一个 Pig 脚本来打印所有按出生月份分组的学生。截至目前，我有以下用户定义函数（写在 Java 中）：

public class FormatDate extends EvalFunc<DataBag> {
    TupleFactory mTupleFactory = TupleFactory.getInstance();
    BagFactory mBagFactory = BagFactory.getInstance();

    static int id = 0 ;
    public DataBag exec(Tuple input) throws IOException {
        try {
            DataBag output = mBagFactory.newDefaultBag();
            Object o = input.get(0);
            if (!(o instanceof String)) {
                throw new IOException("Expected input to be chararray, but got " + o.getClass().getName());
            }

            Tuple t = mTupleFactory.newTuple(4);
            StringTokenizer tok = new StringTokenizer((String)o, "/", false);

            int i = 0 ;
            t.set (0, id) ;
            while (tok.hasMoreTokens() && i < 4) {
                i ++ ;
                t.set (i, new String (tok.nextToken ())) ;
            }
            output.add(t);

            return output;
        } catch (ExecException ee) {
            // error handling goes here
        }
        return null ;
    }
}

我当前的 Pig 脚本如下所示。我对此很陌生，所以它可能很糟糕。

REGISTER ./myudfs.jar ;
DEFINE DATE myudfs.FormatDate ;
R1 = LOAD 'students.txt' USING PigStorage('\t') 
     AS (stud_id : int, firstname : chararray, lastname : chararray, birthdate : chararray) ;
R2 = DISTINCT R1 ;
R3 = FOREACH R2 GENERATE stud_id, firstname, lastname, birthdate, FLATTEN(DATE(birthdate)) AS (id : int, day : chararray, month : chararray, year : chararray) ;
R4 = FOREACH R3 GENERATE stud_id, firstname, lastname, birthdate, month ;
R5 = GROUP R4 BY (month) ;
DUMP R5;

我不知道如何在不逐行妥协的情况下摆脱“月份”列。提前谢谢你。

Answer 1

我猜您不想 'see' 月份字段，但仍然有按月份分组的数据？

继续您的脚本，使用嵌套 FOREACH 选择包分组中存在的字段：

R6 = FOREACH R5 {
    student = FOREACH R4 GENERATE stud_id, firstname, lastname, birthdate;
    GENERATE student;
}

DUMP R6;

我如何 "group by" 使用一个列而不显示它？

How can I "group by" using a column without displaying it?

java

user-defined-functions

apache-pig