Pig 如何使用过滤器格式化半结构化 CSV
Pig how to format a semi-structured CSV with filters
我有半结构化的 CSV,看起来像这样。
VTS,01,0099,7022606164,SP,GP,33,060646,A,1258.9805,N,07735.9303,E,0.0,278.6,280515,0000,00,4000,11,999,842,4B61
VTS,01,0099,7022606164,NM,GP,20,060637,A,1258.9805,N,07735.9302,E,0.0,278.6,280515,0000,00,4000,11,999,841,7407+++
VTS,66,0065,7022606164,NM,0,GP,22,060648,280515,1258.9804,N,07735.9301,E,04AE+++
VTS,01,0099,7022606164,NM,GP,22,060656,A,1258.9804,N,07735.9301,E,0.0,278.6,280515,0000,00,4000,11,999,843,8FEB+++
VTS,01,0099,7022606164,NM,GP,22,060721,A,1258.9803,N,07735.9304,E,0.0,278.6,280515,0000,00,4000,11,999,845,044D++++++
VTS,99,0065,7022606164,NM,0,A,GP,22,060648,280515,1258.9804,N,07735.9301,E,04AE+++
VTS,99,0065,7022606164,NM,0,A,GP,22,060648,280515,1258.9804,N,07735.9301,E,04AE
我想用这些数据制作三个不同的 table。即一个带有 VTS,01 另一个带有 VTS,99,另一个带有 VTS,66。同样,我还需要删除每行附加的“+++”作为错误,为此我编写了这个猪脚本。
data = load '/user/simulator/SKYTRACK/27thMay2015' using PigStorage('\n') as (f1:chararray);
splt = foreach data generate FLATTEN(STRSPLIT([=11=], '\+++'));
data_pkt = FILTER splt BY [=11=] MATCHES '.*VTS,01+.*';
sos_pkt = FILTER splt BY MATCHES '.*VTS,99+.*';
health_pkt = FILTER splt BY MATCHES '.*VTS,66+.*';
当我针对每个 table 单独测试此脚本时,只有一个输出我收到其余没有输出,
dump data_pkt;
dump sos_pkt;
dump health_pkt;
我对猪很陌生,所以谁能帮我解决这个问题..我将不胜感激。
这将根据值过滤您的记录。
a = load '/abc.txt' using PigStorage(',');
b1 = FILTER a by ==01;
b66 = FILTER a by ==66;
b99 = FILTER a by ==99;
要删除+++,您必须编写一个简单的pig udf。
输出:
(VTS,99,0065,7022606164,NM,0,A,GP,22,060648,280515,1258.9804,N,07735.9301,E,04AE+++)
(VTS,99,0065,7022606164,NM,0,A,GP,22,060648,280515,1258.9804,N,07735.9301,E,04AE)
要删除+++,您还需要转义所有的“+”,而不仅仅是唯一的一个。
您对这些加号的含义不是很具体。您宁愿使用该正则表达式来拆分:
"\+{3,}"
因此,在您的 pig 脚本中:
splt = foreach data generate FLATTEN(STRSPLIT([=11=], '\+{3,}'));
虽然 Aman 是正确的,但是,我宁愿使用 SPLIT 而不是 FILTER 来分离数据集:
a = load '/abc.txt';
SPLIT a INTO
b01 IF == 01,
b66 IF == 66,
b99 IF == 69;
现在效果不错。
data = load '/user/simulator/SKYTRACK/27thMay2015' using PigStorage(',');
splt = foreach data generate [=10=] as col0:chararray, as col1:chararray, as col2:chararray, as col3:chararray, as col4:chararray, as col5:chararray, as col6:chararray, as col7:chararray, as col8:chararray, as col9:chararray, as col10:chararray, as col11:chararray, as col12:chararray,, FLATTEN(STRSPLIT(, '\+++'));
data_pkt = FILTER splt BY MATCHES '.*01+.*';
health_pkt = FILTER splt BY MATCHES '.*66+.*';
sos_pkt = FILTER splt BY MATCHES '.*99+.*';
但问题是三步。
我有半结构化的 CSV,看起来像这样。
VTS,01,0099,7022606164,SP,GP,33,060646,A,1258.9805,N,07735.9303,E,0.0,278.6,280515,0000,00,4000,11,999,842,4B61
VTS,01,0099,7022606164,NM,GP,20,060637,A,1258.9805,N,07735.9302,E,0.0,278.6,280515,0000,00,4000,11,999,841,7407+++
VTS,66,0065,7022606164,NM,0,GP,22,060648,280515,1258.9804,N,07735.9301,E,04AE+++
VTS,01,0099,7022606164,NM,GP,22,060656,A,1258.9804,N,07735.9301,E,0.0,278.6,280515,0000,00,4000,11,999,843,8FEB+++
VTS,01,0099,7022606164,NM,GP,22,060721,A,1258.9803,N,07735.9304,E,0.0,278.6,280515,0000,00,4000,11,999,845,044D++++++
VTS,99,0065,7022606164,NM,0,A,GP,22,060648,280515,1258.9804,N,07735.9301,E,04AE+++
VTS,99,0065,7022606164,NM,0,A,GP,22,060648,280515,1258.9804,N,07735.9301,E,04AE
我想用这些数据制作三个不同的 table。即一个带有 VTS,01 另一个带有 VTS,99,另一个带有 VTS,66。同样,我还需要删除每行附加的“+++”作为错误,为此我编写了这个猪脚本。
data = load '/user/simulator/SKYTRACK/27thMay2015' using PigStorage('\n') as (f1:chararray);
splt = foreach data generate FLATTEN(STRSPLIT([=11=], '\+++'));
data_pkt = FILTER splt BY [=11=] MATCHES '.*VTS,01+.*';
sos_pkt = FILTER splt BY MATCHES '.*VTS,99+.*';
health_pkt = FILTER splt BY MATCHES '.*VTS,66+.*';
当我针对每个 table 单独测试此脚本时,只有一个输出我收到其余没有输出,
dump data_pkt;
dump sos_pkt;
dump health_pkt;
我对猪很陌生,所以谁能帮我解决这个问题..我将不胜感激。
这将根据值过滤您的记录。
a = load '/abc.txt' using PigStorage(',');
b1 = FILTER a by ==01;
b66 = FILTER a by ==66;
b99 = FILTER a by ==99;
要删除+++,您必须编写一个简单的pig udf。
输出:
(VTS,99,0065,7022606164,NM,0,A,GP,22,060648,280515,1258.9804,N,07735.9301,E,04AE+++)
(VTS,99,0065,7022606164,NM,0,A,GP,22,060648,280515,1258.9804,N,07735.9301,E,04AE)
要删除+++,您还需要转义所有的“+”,而不仅仅是唯一的一个。 您对这些加号的含义不是很具体。您宁愿使用该正则表达式来拆分:
"\+{3,}"
因此,在您的 pig 脚本中:
splt = foreach data generate FLATTEN(STRSPLIT([=11=], '\+{3,}'));
虽然 Aman 是正确的,但是,我宁愿使用 SPLIT 而不是 FILTER 来分离数据集:
a = load '/abc.txt';
SPLIT a INTO
b01 IF == 01,
b66 IF == 66,
b99 IF == 69;
现在效果不错。
data = load '/user/simulator/SKYTRACK/27thMay2015' using PigStorage(',');
splt = foreach data generate [=10=] as col0:chararray, as col1:chararray, as col2:chararray, as col3:chararray, as col4:chararray, as col5:chararray, as col6:chararray, as col7:chararray, as col8:chararray, as col9:chararray, as col10:chararray, as col11:chararray, as col12:chararray,, FLATTEN(STRSPLIT(, '\+++'));
data_pkt = FILTER splt BY MATCHES '.*01+.*';
health_pkt = FILTER splt BY MATCHES '.*66+.*';
sos_pkt = FILTER splt BY MATCHES '.*99+.*';
但问题是三步。