如何在不使用 UDF 的情况下在 PIG 中获取匹配值?
How do I get matching values in PIG without using UDF?
将这些作为我的输入文件,
Input 1: (File 1)
12,23,14,15,9
1,2,3,4,5
34,17,8
.
.
Input 2: (File 2)
12 Twelve
23 TwentyThree
34 ThirtyFour
.
.
我将使用我的 PIG 脚本读取 "Input 1" 文件中的每一行,我希望根据 "Input 2" 文件获得如下结果。
Output:
Twelve,TwentyThree,Fourteen,Fifteen,Nine
One,Two,Three,Four,Five
.
.
没有UDF可以实现吗?请让我知道您的建议。
提前致谢!
这是一个 Hive 解决方案:
--Load the data into Hive
CREATE TABLE file1 (
line array<string>
)
ROW FORMAT DELIMITED
COLLECTION ITEMS TERMINATED BY ',';
LOAD DATA INPATH '/tmp/test2/file1' OVERWRITE INTO TABLE file1;
CREATE TABLE file2 (
name string,
value string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';
LOAD DATA INPATH '/tmp/test2/file2' OVERWRITE INTO TABLE file2;
--explode the rows from the first table and create a newid to use for correlation
CREATE TABLE file1_exploded
AS
WITH tmp
AS
(SELECT RAND() newid, line from file1)
SELECT newid, item FROM tmp
LATERAL VIEW EXPLODE (line) a AS item;
--apply substitions using the second table, then join lines back together
SELECT CONCAT_WS(',', COLLECT_LIST(value))
FROM
file1_exploded
JOIN file2 ON item = name
GROUP BY newid;
这违反了您的 'No UDF' 标准,但 UDF 是内置的,所以我怀疑它就足够了。
查询:
data1 = LOAD 'file1' AS (val:chararray);
data2 = LOAD 'file2' AS (num:chararray, desc:chararray);
A = RANK data1; /* creates row number*/
B = FOREACH A GENERATE rank_data1, FLATTEN(TOKENIZE(val, ',')) AS num;
C = RANK B; /* used to keep tuple elements sorted in bag*/
D = JOIN C BY num, data2 BY num;
E = FOREACH D GENERATE C::rank_data1 AS rank_1:long
, C::rank_B AS rank_2:long
, data2::desc AS description;
grpd = GROUP E BY rank_1;
F = FOREACH grpd {
sorted = ORDER E BY rank_2;
GENERATE sorted;
};
X = FOREACH F GENERATE FLATTEN(BagToTuple(sorted.description));
DUMP X;
输出:
(Twelve,TwentyThree,Fourteen,Fifteen,Nine)
(One,Two,Three,Four,Five)
(ThirtyFour,Seventeen,Eight)
将这些作为我的输入文件,
Input 1: (File 1)
12,23,14,15,9
1,2,3,4,5
34,17,8
.
.
Input 2: (File 2)
12 Twelve
23 TwentyThree
34 ThirtyFour
.
.
我将使用我的 PIG 脚本读取 "Input 1" 文件中的每一行,我希望根据 "Input 2" 文件获得如下结果。
Output:
Twelve,TwentyThree,Fourteen,Fifteen,Nine
One,Two,Three,Four,Five
.
.
没有UDF可以实现吗?请让我知道您的建议。
提前致谢!
这是一个 Hive 解决方案:
--Load the data into Hive
CREATE TABLE file1 (
line array<string>
)
ROW FORMAT DELIMITED
COLLECTION ITEMS TERMINATED BY ',';
LOAD DATA INPATH '/tmp/test2/file1' OVERWRITE INTO TABLE file1;
CREATE TABLE file2 (
name string,
value string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';
LOAD DATA INPATH '/tmp/test2/file2' OVERWRITE INTO TABLE file2;
--explode the rows from the first table and create a newid to use for correlation
CREATE TABLE file1_exploded
AS
WITH tmp
AS
(SELECT RAND() newid, line from file1)
SELECT newid, item FROM tmp
LATERAL VIEW EXPLODE (line) a AS item;
--apply substitions using the second table, then join lines back together
SELECT CONCAT_WS(',', COLLECT_LIST(value))
FROM
file1_exploded
JOIN file2 ON item = name
GROUP BY newid;
这违反了您的 'No UDF' 标准,但 UDF 是内置的,所以我怀疑它就足够了。
查询:
data1 = LOAD 'file1' AS (val:chararray);
data2 = LOAD 'file2' AS (num:chararray, desc:chararray);
A = RANK data1; /* creates row number*/
B = FOREACH A GENERATE rank_data1, FLATTEN(TOKENIZE(val, ',')) AS num;
C = RANK B; /* used to keep tuple elements sorted in bag*/
D = JOIN C BY num, data2 BY num;
E = FOREACH D GENERATE C::rank_data1 AS rank_1:long
, C::rank_B AS rank_2:long
, data2::desc AS description;
grpd = GROUP E BY rank_1;
F = FOREACH grpd {
sorted = ORDER E BY rank_2;
GENERATE sorted;
};
X = FOREACH F GENERATE FLATTEN(BagToTuple(sorted.description));
DUMP X;
输出:
(Twelve,TwentyThree,Fourteen,Fifteen,Nine)
(One,Two,Three,Four,Five)
(ThirtyFour,Seventeen,Eight)