Pig 是否可以用于从带分区的 HDFS 中的 Parquet table 加载，并将分区添加为列？

Question

我有一个 Impala 分区 table，存储为 Parquet。我可以使用 Pig 从此 table 加载数据并将分区添加为列吗？

Parquet table 定义为：

create table test.test_pig (
    name: chararray,
    id bigint
)
partitioned by (gender chararray, age int)
stored as parquet;

Pig 脚本如下：

A = LOAD '/test/test_pig' USING parquet.pig.ParquetLoader AS (name: bytearray, id: long);

然而，gender 和 age 在 DUMP A 时缺失。仅显示 name 和 id。

我试过：

A = LOAD '/test/test_pig' USING parquet.pig.ParquetLoader AS (name: bytearray, id: long, gender: chararray, age: int);

但我会收到如下错误：

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1031: Incompatable schema: left is "name:bytearray,id:long,gender:bytearray,age:int", right is "name:bytearray,id:long"

希望在这里得到一些建议。谢谢！

Answer 1

您应该使用 org.apache.hcatalog.pig.HCatLoader 库进行测试。

正常情况下，Pig 支持读取 from/write 到分区表；

阅读：

This load statement will load all partitions of the specified table. /* myscript.pig */ A = LOAD 'tablename' USING org.apache.hcatalog.pig.HCatLoader(); ... ... If only some partitions of the specified table are needed, include a partition filter statement immediately following the load statement in the data flow. (In the script, however, a filter statement might not immediately follow its load statement.) The filter statement can include conditions on partition as well as non-partition columns.

https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadStore#HCatalogLoadStore-RunningPigwithHCatalog

写入

HCatOutputFormat will trigger on dynamic partitioning usage if necessary (if a key value is not specified) and will inspect the data to write it out appropriately.

https://cwiki.apache.org/confluence/display/Hive/HCatalog+DynamicPartitions

但是，我认为这还没有用 parquet 文件进行适当的测试（至少没有被 Cloudera 的家伙测试）：

Parquet has not been tested with HCatalog. Without HCatalog, Pig cannot correctly read dynamically partitioned tables; that is true for all file formats.

http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_parquet.html

Pig 是否可以用于从带分区的 HDFS 中的 Parquet table 加载，并将分区添加为列？

Can Pig be used to LOAD from Parquet table in HDFS with partition, and add partitions as columns?

apache-pig

database-partitioning

hdfs

parquet