Pig 是否可以用于从带分区的 HDFS 中的 Parquet table 加载,并将分区添加为列?
Can Pig be used to LOAD from Parquet table in HDFS with partition, and add partitions as columns?
我有一个 Impala 分区 table,存储为 Parquet。我可以使用 Pig 从此 table 加载数据并将分区添加为列吗?
Parquet table 定义为:
create table test.test_pig (
name: chararray,
id bigint
)
partitioned by (gender chararray, age int)
stored as parquet;
Pig 脚本如下:
A = LOAD '/test/test_pig' USING parquet.pig.ParquetLoader AS (name: bytearray, id: long);
然而,gender
和 age
在 DUMP A
时缺失。仅显示 name
和 id
。
我试过:
A = LOAD '/test/test_pig' USING parquet.pig.ParquetLoader AS (name: bytearray, id: long, gender: chararray, age: int);
但我会收到如下错误:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1031: Incompatable
schema: left is "name:bytearray,id:long,gender:bytearray,age:int",
right is "name:bytearray,id:long"
希望在这里得到一些建议。谢谢!
您应该使用 org.apache.hcatalog.pig.HCatLoader 库进行测试。
正常情况下,Pig 支持读取 from/write 到分区表;
阅读:
This load statement will load all partitions of the specified table.
/* myscript.pig */
A = LOAD 'tablename' USING org.apache.hcatalog.pig.HCatLoader();
...
...
If only some partitions of the specified table are needed, include a partition filter statement immediately following the load statement in the data flow. (In the script, however, a filter statement might not immediately follow its load statement.) The filter statement can include conditions on partition as well as non-partition columns.
写入
HCatOutputFormat will trigger on dynamic partitioning usage if necessary (if a key value is not specified) and will inspect the data to write it out appropriately.
https://cwiki.apache.org/confluence/display/Hive/HCatalog+DynamicPartitions
但是,我认为这还没有用 parquet 文件进行适当的测试(至少没有被 Cloudera 的家伙测试):
Parquet has not been tested with HCatalog. Without HCatalog, Pig cannot correctly read dynamically partitioned tables; that is true for all file formats.
http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_parquet.html
我有一个 Impala 分区 table,存储为 Parquet。我可以使用 Pig 从此 table 加载数据并将分区添加为列吗?
Parquet table 定义为:
create table test.test_pig (
name: chararray,
id bigint
)
partitioned by (gender chararray, age int)
stored as parquet;
Pig 脚本如下:
A = LOAD '/test/test_pig' USING parquet.pig.ParquetLoader AS (name: bytearray, id: long);
然而,gender
和 age
在 DUMP A
时缺失。仅显示 name
和 id
。
我试过:
A = LOAD '/test/test_pig' USING parquet.pig.ParquetLoader AS (name: bytearray, id: long, gender: chararray, age: int);
但我会收到如下错误:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1031: Incompatable schema: left is "name:bytearray,id:long,gender:bytearray,age:int", right is "name:bytearray,id:long"
希望在这里得到一些建议。谢谢!
您应该使用 org.apache.hcatalog.pig.HCatLoader 库进行测试。
正常情况下,Pig 支持读取 from/write 到分区表;
阅读:
This load statement will load all partitions of the specified table. /* myscript.pig */ A = LOAD 'tablename' USING org.apache.hcatalog.pig.HCatLoader(); ... ... If only some partitions of the specified table are needed, include a partition filter statement immediately following the load statement in the data flow. (In the script, however, a filter statement might not immediately follow its load statement.) The filter statement can include conditions on partition as well as non-partition columns.
写入
HCatOutputFormat will trigger on dynamic partitioning usage if necessary (if a key value is not specified) and will inspect the data to write it out appropriately.
https://cwiki.apache.org/confluence/display/Hive/HCatalog+DynamicPartitions
但是,我认为这还没有用 parquet 文件进行适当的测试(至少没有被 Cloudera 的家伙测试):
Parquet has not been tested with HCatalog. Without HCatalog, Pig cannot correctly read dynamically partitioned tables; that is true for all file formats.
http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_parquet.html