我如何 "group by" 使用一个列而不显示它?
How can I "group by" using a column without displaying it?
所以我有一个名为“students.txt”的输入文件,其中包含以下结构:id, first name, last name, date of birth
。
这是它的内容:
111111 Harry Cover 28/01/1986
222222 John Doeuf 03/01/1996
333333 Jacques Selere 18/07/1998
444444 Jean Breille 06/08/1991
我正在尝试创建一个 Pig 脚本来打印所有按出生月份分组的学生。截至目前,我有以下用户定义函数(写在 Java 中):
public class FormatDate extends EvalFunc<DataBag> {
TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();
static int id = 0 ;
public DataBag exec(Tuple input) throws IOException {
try {
DataBag output = mBagFactory.newDefaultBag();
Object o = input.get(0);
if (!(o instanceof String)) {
throw new IOException("Expected input to be chararray, but got " + o.getClass().getName());
}
Tuple t = mTupleFactory.newTuple(4);
StringTokenizer tok = new StringTokenizer((String)o, "/", false);
int i = 0 ;
t.set (0, id) ;
while (tok.hasMoreTokens() && i < 4) {
i ++ ;
t.set (i, new String (tok.nextToken ())) ;
}
output.add(t);
return output;
} catch (ExecException ee) {
// error handling goes here
}
return null ;
}
}
我当前的 Pig 脚本如下所示。我对此很陌生,所以它可能很糟糕。
REGISTER ./myudfs.jar ;
DEFINE DATE myudfs.FormatDate ;
R1 = LOAD 'students.txt' USING PigStorage('\t')
AS (stud_id : int, firstname : chararray, lastname : chararray, birthdate : chararray) ;
R2 = DISTINCT R1 ;
R3 = FOREACH R2 GENERATE stud_id, firstname, lastname, birthdate, FLATTEN(DATE(birthdate)) AS (id : int, day : chararray, month : chararray, year : chararray) ;
R4 = FOREACH R3 GENERATE stud_id, firstname, lastname, birthdate, month ;
R5 = GROUP R4 BY (month) ;
DUMP R5;
我不知道如何在不逐行妥协的情况下摆脱“月份”列。
提前谢谢你。
我猜您不想 'see' 月份字段,但仍然有按月份分组的数据?
继续您的脚本,使用嵌套 FOREACH
选择包分组中存在的字段:
R6 = FOREACH R5 {
student = FOREACH R4 GENERATE stud_id, firstname, lastname, birthdate;
GENERATE student;
}
DUMP R6;
所以我有一个名为“students.txt”的输入文件,其中包含以下结构:id, first name, last name, date of birth
。
这是它的内容:
111111 Harry Cover 28/01/1986
222222 John Doeuf 03/01/1996
333333 Jacques Selere 18/07/1998
444444 Jean Breille 06/08/1991
我正在尝试创建一个 Pig 脚本来打印所有按出生月份分组的学生。截至目前,我有以下用户定义函数(写在 Java 中):
public class FormatDate extends EvalFunc<DataBag> {
TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();
static int id = 0 ;
public DataBag exec(Tuple input) throws IOException {
try {
DataBag output = mBagFactory.newDefaultBag();
Object o = input.get(0);
if (!(o instanceof String)) {
throw new IOException("Expected input to be chararray, but got " + o.getClass().getName());
}
Tuple t = mTupleFactory.newTuple(4);
StringTokenizer tok = new StringTokenizer((String)o, "/", false);
int i = 0 ;
t.set (0, id) ;
while (tok.hasMoreTokens() && i < 4) {
i ++ ;
t.set (i, new String (tok.nextToken ())) ;
}
output.add(t);
return output;
} catch (ExecException ee) {
// error handling goes here
}
return null ;
}
}
我当前的 Pig 脚本如下所示。我对此很陌生,所以它可能很糟糕。
REGISTER ./myudfs.jar ;
DEFINE DATE myudfs.FormatDate ;
R1 = LOAD 'students.txt' USING PigStorage('\t')
AS (stud_id : int, firstname : chararray, lastname : chararray, birthdate : chararray) ;
R2 = DISTINCT R1 ;
R3 = FOREACH R2 GENERATE stud_id, firstname, lastname, birthdate, FLATTEN(DATE(birthdate)) AS (id : int, day : chararray, month : chararray, year : chararray) ;
R4 = FOREACH R3 GENERATE stud_id, firstname, lastname, birthdate, month ;
R5 = GROUP R4 BY (month) ;
DUMP R5;
我不知道如何在不逐行妥协的情况下摆脱“月份”列。 提前谢谢你。
我猜您不想 'see' 月份字段,但仍然有按月份分组的数据?
继续您的脚本,使用嵌套 FOREACH
选择包分组中存在的字段:
R6 = FOREACH R5 {
student = FOREACH R4 GENERATE stud_id, firstname, lastname, birthdate;
GENERATE student;
}
DUMP R6;