使用 Pig 脚本将文本文件转换为 avro
Convert text file to avro using Pig script
我正在使用 pig 脚本将文本文件转换为 Avro
`我有一个竖线分隔格式的文本文件,位于 /user/hduser/pig_input/abc.dat
1|8|123|985|659856|10000000002546
1|8|123|985|659856|10000000002546
1|8|123|985|659856|10000000002546
1|8|123|985|659856|10000000002546
1|8|123|985|659856|10000000002546
架构文件位于 hdfs /user/hduser/pig_schema_files/abc.avsc
{
"type" : "record",
"name" : "import_dummy",
"doc" : "import_123dummy",
"fields" : [ {
"name" : "ID",
"type" : [ "string", "null" ],
"columnName" : "ID",
"sqlType" : "3"
}, {
"name" : "TRANS_O",
"type" : [ "string", "null" ],
"columnName" : "TRANS_O",
"sqlType" : "3"
}, {
"name" : "CARD_O",
"type" : [ "string", "null" ],
"columnName" : "CARD_O",
"sqlType" : "3"
}, {
"name" : "SEQ_O",
"type" : [ "string", "null" ],
"columnName" : "SEQ_O",
"sqlType" : "1"
}, {
"name" : "DATE_O",
"type" : [ "string", "null" ],
"columnName" : "DATE_O",
"sqlType" : "3"
}],"tableName" : "123dummy"}
以下是我写的脚本
REGISTER /app/cloudera/parcels/CDH/lib/pig/piggybank.jar
REGISTER /app/cloudera/parcels/CDH/lib/pig/lib/avro-1.3.7.jar
REGISTER /app/cloudera/parcels/CDH/lib/pig/lib/jackson-core-asl.jar
REGISTER /app/cloudera/parcels/CDH/lib/pig/lib/jackson-mapper-asl.jar
REGISTER /app/cloudera/parcels/CDH/lib/pig/lib/json-simple.jar
REGISTER /app/cloudera/parcels/CDH/lib/pig/lib/snappy-java.jar
textfile = load 'user/hduser/pig_input/abc.dat' using pigStorage('|');
STORE textfile INTO '/user/hduser/pig_output/'
USING org.apache.pig.piggybank.storage.avro.AvroStorage('schema_file','/user/hduser/pig_schema_files/abc.avsc');
我在 运行 脚本之后收到以下错误:
2015-02-03 09:46:56,369 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 6000:<file script.pig, line 9, column 0>
Output Location Validation Failed for: '/user/hduser/pig_output/
More info to follow:
Output schema is null!
连同 CSV 文件,我们必须读取字段名称,以便在写入时 avro
,它会自动映射字段名称。
textfile = load 'user/hduser/pig_input/abc.dat' using pigStorage('|') as (ID, TRANS_O,CARD_O,SEQ_O, DATE_O );
STORE textfile INTO '/user/hduser/pig_output/' USING org.apache.pig.piggybank.storage.avro.AvroStorage();
我正在使用 pig 脚本将文本文件转换为 Avro
`我有一个竖线分隔格式的文本文件,位于 /user/hduser/pig_input/abc.dat
1|8|123|985|659856|10000000002546
1|8|123|985|659856|10000000002546
1|8|123|985|659856|10000000002546
1|8|123|985|659856|10000000002546
1|8|123|985|659856|10000000002546
架构文件位于 hdfs /user/hduser/pig_schema_files/abc.avsc
{
"type" : "record",
"name" : "import_dummy",
"doc" : "import_123dummy",
"fields" : [ {
"name" : "ID",
"type" : [ "string", "null" ],
"columnName" : "ID",
"sqlType" : "3"
}, {
"name" : "TRANS_O",
"type" : [ "string", "null" ],
"columnName" : "TRANS_O",
"sqlType" : "3"
}, {
"name" : "CARD_O",
"type" : [ "string", "null" ],
"columnName" : "CARD_O",
"sqlType" : "3"
}, {
"name" : "SEQ_O",
"type" : [ "string", "null" ],
"columnName" : "SEQ_O",
"sqlType" : "1"
}, {
"name" : "DATE_O",
"type" : [ "string", "null" ],
"columnName" : "DATE_O",
"sqlType" : "3"
}],"tableName" : "123dummy"}
以下是我写的脚本
REGISTER /app/cloudera/parcels/CDH/lib/pig/piggybank.jar
REGISTER /app/cloudera/parcels/CDH/lib/pig/lib/avro-1.3.7.jar
REGISTER /app/cloudera/parcels/CDH/lib/pig/lib/jackson-core-asl.jar
REGISTER /app/cloudera/parcels/CDH/lib/pig/lib/jackson-mapper-asl.jar
REGISTER /app/cloudera/parcels/CDH/lib/pig/lib/json-simple.jar
REGISTER /app/cloudera/parcels/CDH/lib/pig/lib/snappy-java.jar
textfile = load 'user/hduser/pig_input/abc.dat' using pigStorage('|');
STORE textfile INTO '/user/hduser/pig_output/'
USING org.apache.pig.piggybank.storage.avro.AvroStorage('schema_file','/user/hduser/pig_schema_files/abc.avsc');
我在 运行 脚本之后收到以下错误:
2015-02-03 09:46:56,369 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 6000:<file script.pig, line 9, column 0>
Output Location Validation Failed for: '/user/hduser/pig_output/
More info to follow:
Output schema is null!
连同 CSV 文件,我们必须读取字段名称,以便在写入时 avro
,它会自动映射字段名称。
textfile = load 'user/hduser/pig_input/abc.dat' using pigStorage('|') as (ID, TRANS_O,CARD_O,SEQ_O, DATE_O );
STORE textfile INTO '/user/hduser/pig_output/' USING org.apache.pig.piggybank.storage.avro.AvroStorage();