写一个组的唯一键作为文件夹名称和包内容作为记录?
Write a unique key of the group as folder name and the bag content as records?
Objective : 写组唯一键为文件夹名,包内容为记录。
File : employee.txt
#JoiningDate Employee Id Employee Name
20140302 1 A
20140302 2 B
20140302 3 C
20140303 4 D
20140303 5 E
20140303 6 F
猪脚本:
X = load 'employee.txt' using PigStorage('\t') as (joining_date:chararray, employee_id:long, employee_name:chararray);
Y = group X by joining_date;
Output of this would be (Y) :
(20140302, {(20140302,1,A), (20140302,2,B), (20140302,3,C)})
(20140303, {(20140303,4,D), (20140303,5,E), (20140303,6,F)})
Objective 是在输出路径中有两个文件夹:
1. outputfolder/20140302 : having three records
20140302,1,A
20140302,2,B
20140302,3,C
2. outputfolder/20140303 :
20140303,4,D
20140303,5,E
20140303,6,F
尝试过
store Y into 'outputfolder' using org.apache.pig.piggybank.storage.MultiStorage('outputfolder', '0', 'none', ',');
看到结果如下:
1. outputfolder/20140302/20140302-0
(20140302, {(20140302,1,A), (20140302,2,B), (20140302,3,C)})
2. outputfolder/20140303/20140303-0
(20140303, {(20140303,4,D), (20140303,5,E), (20140303,6,F)})
一个选项可能只是在 store
命令之前展平值。
X = load 'employee.txt' using PigStorage('\t') as (joining_date:chararray, employee_id:long, employee_name:chararray);
Y = group X by joining_date;
Z = FOREACH Y GENERATE FLATTEN();
store Z into 'outputfolder' using org.apache.pig.piggybank.storage.MultiStorage('outputfolder', '0', 'none', ',');
输出将存储在 outputfolder/20140302
文件夹中,文件名以这样的开头 20140302-0,000
Objective : 写组唯一键为文件夹名,包内容为记录。
File : employee.txt
#JoiningDate Employee Id Employee Name
20140302 1 A
20140302 2 B
20140302 3 C
20140303 4 D
20140303 5 E
20140303 6 F
猪脚本:
X = load 'employee.txt' using PigStorage('\t') as (joining_date:chararray, employee_id:long, employee_name:chararray);
Y = group X by joining_date;
Output of this would be (Y) :
(20140302, {(20140302,1,A), (20140302,2,B), (20140302,3,C)})
(20140303, {(20140303,4,D), (20140303,5,E), (20140303,6,F)})
Objective 是在输出路径中有两个文件夹:
1. outputfolder/20140302 : having three records
20140302,1,A
20140302,2,B
20140302,3,C
2. outputfolder/20140303 :
20140303,4,D
20140303,5,E
20140303,6,F
尝试过
store Y into 'outputfolder' using org.apache.pig.piggybank.storage.MultiStorage('outputfolder', '0', 'none', ',');
看到结果如下:
1. outputfolder/20140302/20140302-0
(20140302, {(20140302,1,A), (20140302,2,B), (20140302,3,C)})
2. outputfolder/20140303/20140303-0
(20140303, {(20140303,4,D), (20140303,5,E), (20140303,6,F)})
一个选项可能只是在 store
命令之前展平值。
X = load 'employee.txt' using PigStorage('\t') as (joining_date:chararray, employee_id:long, employee_name:chararray);
Y = group X by joining_date;
Z = FOREACH Y GENERATE FLATTEN();
store Z into 'outputfolder' using org.apache.pig.piggybank.storage.MultiStorage('outputfolder', '0', 'none', ',');
输出将存储在 outputfolder/20140302
文件夹中,文件名以这样的开头 20140302-0,000