在 pig latin 中选择行
Selecting rows in pig latin
我有这样的数据:
(a,b,c)
(a,c,b)
(a,b,d)
是否有类似 DISTINCT 的东西会产生如下所示的输出?
(a,b,c)
(a,b,d)
我想忽略顺序,只比较元素。
No.Your 最好的选择是编写一个 UDF,它将获取每一行,对字段进行排序,return 一个有序的字符串,然后在其上使用 distinct。
小猪
REGISTER ORDER_UDF.jar;
A = LOAD 'data.txt' USING PigStorage(',') AS (a1: chararray, a2: chararray, a3: chararray);
B = FOREACH A GENERATE ORDER_UDF.ORDER(CONCAT(CONCAT(a1,a2),a3));
C = DISTINCT B;
D = FOREACH C GENERATE REPLACE([=10=],'',','); -- Get back the comma separated fields from the concatenated string.
DUMP D;
UDF
import java.io.IOException;
import java.util.Arrays;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class ORDER extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
char tempArray[] = ((String)input).toCharArray();
Arrays.sort(tempArray);
return new String(tempArray);
}catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}
认为值得考虑,我建议读取数据并将其转换为包,对包进行排序和 select 不同的数据。
猪脚本:
inp_data = load 'input.csv' USING PigStorage(',') AS (field1:chararray,field2:chararray,field3:chararray);
req_data = FOREACH inp_data GENERATE TOBAG(field1,field2,field3) AS (b:bag{t:(token:chararray)});
sorted_data = FOREACH req_data {
sorted = ORDER b BY token;
GENERATE sorted AS (sorted_bag:bag{t:(token:chararray)});
}
req_data_fmt = DISTINCT(FOREACH sorted_data GENERATE BagToString(sorted_bag,','));
DUMP req_data_fmt;
我有这样的数据:
(a,b,c)
(a,c,b)
(a,b,d)
是否有类似 DISTINCT 的东西会产生如下所示的输出?
(a,b,c)
(a,b,d)
我想忽略顺序,只比较元素。
No.Your 最好的选择是编写一个 UDF,它将获取每一行,对字段进行排序,return 一个有序的字符串,然后在其上使用 distinct。
小猪
REGISTER ORDER_UDF.jar;
A = LOAD 'data.txt' USING PigStorage(',') AS (a1: chararray, a2: chararray, a3: chararray);
B = FOREACH A GENERATE ORDER_UDF.ORDER(CONCAT(CONCAT(a1,a2),a3));
C = DISTINCT B;
D = FOREACH C GENERATE REPLACE([=10=],'',','); -- Get back the comma separated fields from the concatenated string.
DUMP D;
UDF
import java.io.IOException;
import java.util.Arrays;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class ORDER extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
char tempArray[] = ((String)input).toCharArray();
Arrays.sort(tempArray);
return new String(tempArray);
}catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}
认为值得考虑,我建议读取数据并将其转换为包,对包进行排序和 select 不同的数据。
猪脚本:
inp_data = load 'input.csv' USING PigStorage(',') AS (field1:chararray,field2:chararray,field3:chararray);
req_data = FOREACH inp_data GENERATE TOBAG(field1,field2,field3) AS (b:bag{t:(token:chararray)});
sorted_data = FOREACH req_data {
sorted = ORDER b BY token;
GENERATE sorted AS (sorted_bag:bag{t:(token:chararray)});
}
req_data_fmt = DISTINCT(FOREACH sorted_data GENERATE BagToString(sorted_bag,','));
DUMP req_data_fmt;