Pig:无法使用 PigStorage 加载数据
Pig: Unable to load data using PigStorage
我在 txt 文件中有这个 smaple 数据集(格式:名字、姓氏、年龄、性别)
(Eric,Ack,27,M),(Jeremy,Ross,29,F)
(Jenny,Dicken,27,F),(Vijay,Sampath,40,M)
(Angs,Dicken,28,M),(Venu,Rao,28,M)
(Mahima,Mohanty,29,F),(Kenny,Oath,28,M)
我正在尝试像这样加载此数据:
tuple_record = LOAD '~/Documents/Pig_Tuple.txt' USING PigStorage(',') AS (details:tuple(firstname:chararray,lastname:chararray,age:int,sex:chararray));
但这不起作用:
DUMP tuple_record;
我在 运行 这个命令时得到了这个(即 returns 什么都没有)
()
()
()
()
请告知如何加载此数据集。
的复杂方案部分
cat data;
(3,8,9) (mary,19)
(1,4,7) (john,18)
(2,5,8) (joe,18)
A = LOAD data AS (F:tuple(f1:int,f2:int,f3:int),T:tuple(t1:chararray,t2:int));
DESCRIBE A;
A: {F: (f1: int,f2: int,f3: int),T: (t1: chararray,t2: int)}
DUMP A;
((3,8,9),(mary,19))
((1,4,7),(john,18))
((2,5,8),(joe,18))
原因是元组里面的tuple
和each fields
都有same delimiter
(','
)。在这种情况下,pig 将解析输入并在模式转换中失败。
您可以在您的控制台中看到以下日志
"Unable to interpret the value in field being converted to type tuple, caught ParseException <Unexpect end of tuple> field discarded"
解决这个问题
您需要将元组分隔符 ','
更改为不同的内容。在下面的示例中,我使用 '#'
作为分隔符而不是 ','
。您可以使用 (',')
以外的任何分隔符
您的输入文件有两个元组,但您在加载模式中只定义了一个元组,因此您还需要定义另一个。
示例:
输入
(Eric,Ack,27,M)#(Jeremy,Ross,29,F)
(Jenny,Dicken,27,F)#(Vijay,Sampath,40,M)
(Angs,Dicken,28,M)#(Venu,Rao,28,M)
(Mahima,Mohanty,29,F)#(Kenny,Oath,28,M)
Pigscript:
tuple_record = LOAD '~/Documents/Pig_Tuple.txt' USING PigStorage('#') AS (details:tuple(firstname:chararray,lastname:chararray,age:int,sex:chararray), details1:tuple(firstname1:chararray,lastname1:chararray,age1:int,sex1:chararray));
DUMP tuple_record;
输出:
((Eric,Ack,27,M),(Jeremy,Ross,29,F))
((Jenny,Dicken,27,F),(Vijay,Sampath,40,M))
((Angs,Dicken,28,M),(Venu,Rao,28,M))
((Mahima,Mohanty,29,F),(Kenny,Oath,28,M))
更新:
如何将分隔符“,”更改为不同的东西
选项 1:使用 sed
这是一个非常简单的选项,通过使用 sed 命令将 '),('
模式替换为 ')#('
模式,这样分隔符就会在同一个输入文件中从 ','
更改为 '#'
.(注意:在执行此 sed 脚本之前备份您的输入文件)
>> sed -i -- 's/),(/)#(/g' inputFile
选项 2:在不更改定界符的情况下对 pigscript 进行轻微修改
Pigscript:
--Read each input line as chararray
A = LOAD 'inputFile' AS (line:chararray);
--Remove the character '(',')' from the input
B = FOREACH A GENERATE FLATTEN(REPLACE(line,'[)(]+','')) AS (newline:chararray);
--Split the input using ',' as delimiter, 8 refer to total number of fields
C = FOREACH B GENERATE FLATTEN(STRSPLIT(newline,',',8)) AS (firstname1:chararray,lastname1:chararray,age1:int,sex1:chararray,firstname2:chararray,lastname2:chararray,age2:int,sex2:chararray);
--Group the fields and form tuples
D = FOREACH C GENERATE TOTUPLE(firstname1,lastname1,age1,sex1) AS details1,TOTUPLE(firstname2,lastname2,age2,sex2) AS details2;
--Now you can do whatever you want.
E = FOREACH D GENERATE details1.firstname1,details2.firstname2;
DUMP E;
我在 txt 文件中有这个 smaple 数据集(格式:名字、姓氏、年龄、性别)
(Eric,Ack,27,M),(Jeremy,Ross,29,F)
(Jenny,Dicken,27,F),(Vijay,Sampath,40,M)
(Angs,Dicken,28,M),(Venu,Rao,28,M)
(Mahima,Mohanty,29,F),(Kenny,Oath,28,M)
我正在尝试像这样加载此数据:
tuple_record = LOAD '~/Documents/Pig_Tuple.txt' USING PigStorage(',') AS (details:tuple(firstname:chararray,lastname:chararray,age:int,sex:chararray));
但这不起作用:
DUMP tuple_record;
我在 运行 这个命令时得到了这个(即 returns 什么都没有)
()
()
()
()
请告知如何加载此数据集。
cat data;
(3,8,9) (mary,19)
(1,4,7) (john,18)
(2,5,8) (joe,18)
A = LOAD data AS (F:tuple(f1:int,f2:int,f3:int),T:tuple(t1:chararray,t2:int));
DESCRIBE A;
A: {F: (f1: int,f2: int,f3: int),T: (t1: chararray,t2: int)}
DUMP A;
((3,8,9),(mary,19))
((1,4,7),(john,18))
((2,5,8),(joe,18))
原因是元组里面的tuple
和each fields
都有same delimiter
(','
)。在这种情况下,pig 将解析输入并在模式转换中失败。
您可以在您的控制台中看到以下日志
"Unable to interpret the value in field being converted to type tuple, caught ParseException <Unexpect end of tuple> field discarded"
解决这个问题
您需要将元组分隔符
','
更改为不同的内容。在下面的示例中,我使用'#'
作为分隔符而不是','
。您可以使用 (',') 以外的任何分隔符
您的输入文件有两个元组,但您在加载模式中只定义了一个元组,因此您还需要定义另一个。
示例:
输入
(Eric,Ack,27,M)#(Jeremy,Ross,29,F)
(Jenny,Dicken,27,F)#(Vijay,Sampath,40,M)
(Angs,Dicken,28,M)#(Venu,Rao,28,M)
(Mahima,Mohanty,29,F)#(Kenny,Oath,28,M)
Pigscript:
tuple_record = LOAD '~/Documents/Pig_Tuple.txt' USING PigStorage('#') AS (details:tuple(firstname:chararray,lastname:chararray,age:int,sex:chararray), details1:tuple(firstname1:chararray,lastname1:chararray,age1:int,sex1:chararray));
DUMP tuple_record;
输出:
((Eric,Ack,27,M),(Jeremy,Ross,29,F))
((Jenny,Dicken,27,F),(Vijay,Sampath,40,M))
((Angs,Dicken,28,M),(Venu,Rao,28,M))
((Mahima,Mohanty,29,F),(Kenny,Oath,28,M))
更新:
如何将分隔符“,”更改为不同的东西
选项 1:使用 sed
这是一个非常简单的选项,通过使用 sed 命令将 '),('
模式替换为 ')#('
模式,这样分隔符就会在同一个输入文件中从 ','
更改为 '#'
.(注意:在执行此 sed 脚本之前备份您的输入文件)
>> sed -i -- 's/),(/)#(/g' inputFile
选项 2:在不更改定界符的情况下对 pigscript 进行轻微修改
Pigscript:
--Read each input line as chararray
A = LOAD 'inputFile' AS (line:chararray);
--Remove the character '(',')' from the input
B = FOREACH A GENERATE FLATTEN(REPLACE(line,'[)(]+','')) AS (newline:chararray);
--Split the input using ',' as delimiter, 8 refer to total number of fields
C = FOREACH B GENERATE FLATTEN(STRSPLIT(newline,',',8)) AS (firstname1:chararray,lastname1:chararray,age1:int,sex1:chararray,firstname2:chararray,lastname2:chararray,age2:int,sex2:chararray);
--Group the fields and form tuples
D = FOREACH C GENERATE TOTUPLE(firstname1,lastname1,age1,sex1) AS details1,TOTUPLE(firstname2,lastname2,age2,sex2) AS details2;
--Now you can do whatever you want.
E = FOREACH D GENERATE details1.firstname1,details2.firstname2;
DUMP E;