仅当逗号 (,) 在 Pig 中包含引号 ("") 时才替换逗号 (,)

replace comma(,) only if its inside quotes("") in Pig

我有这样的数据:

1,234,"john, lee", john@xyz.com

我想使用 pig 脚本删除带有 space 的 "" 内的 , 。这样我的数据将如下所示:

1,234,john lee, john@xyz.com

我尝试使用 CSVExcelStorage 加载此数据,但我还需要使用 CSVExcelStorage 不支持的“-tagFile”选项。所以我打算只使用 PigStorage,然后替换引号内的任何逗号 (,)。 我坚持这一点。非常感谢任何帮助。谢谢

以下命令会有所帮助:

csvFile = load '/path/to/file' using PigStorage(',');
result = foreach csvFile generate [=10=] as (field1:chararray), as (field2:chararray),CONCAT(REPLACE(, '\"', '') , REPLACE(, '\"', '')) as field3, as (field4:chararray);

输出:

(1,234,john lee, john@xyz.com)

将其加载到单个字段中,然后使用 STRSPLIT 和 REPLACE

A = LOAD 'data.csv' USING TextLoader() AS (line:chararray);
B = FOREACH A GENERATE STRSPLIT(line,'\"',3); 
C = FOREACH B GENERATE REPLACE(,',','');
D = FOREACH C GENERATE CONCAT(CONCAT([=10=],),); -- You can further use STRSPLIT to get individual fields or just CONCAT
E = FOREACH D GENERATE STRSPLIT(D.[=10=],',',4);
DUMP E;

A

1,234,"john, lee", john@xyz.com

B

(1,234,)(john, lee)(, john@xyz.com)

C

(1,234,)(john lee)(, john@xyz.com)

D

(1,234,john lee, john@xyz.com)

E

(1),(234),(john lee),(john@xyz.com)

我找到了执行此操作的完美方法。一个非常通用的解决方案如下:

data = LOAD 'data.csv' using PigStorage(',','-tagFile') AS (filename:chararray, record:chararray);

/*replace comma(,) if it appears in column content*/
replaceComma = FOREACH data GENERATE filename, REPLACE (record, ',(?!(([^\"]*\"){2})*[^\"]*$)', '');

/*replace the quotes("") which is present around the column if it have comma(,) as its a csv file feature*/
replaceQuotes = FOREACH replaceComma GENERATE filename, REPLACE (,'"','') as record;

详细用例可在 my blog