从 Azure HDInsights 中的分区镶木地板文件创建配置单元外部 table

Question

我在 Azure blob 存储中将数据保存为 parquet 文件。数据按年、月、日和小时分区，如：

cont/data/year=2017/month=02/day=01/

我想使用以下创建语句在 Hive 中创建外部 table，我使用 this reference.

编写的

CREATE EXTERNAL TABLE table_name (uid string, title string, value string) 
PARTITIONED BY (year int, month int, day int) STORED AS PARQUET 
LOCATION 'wasb://cont@storage_name.blob.core.windows.net/data';

这会创建 table 但在查询时没有行。我尝试了没有 PARTITIONED BY 子句的相同创建语句，这似乎有效。所以看起来问题出在分区上。

我错过了什么？

Answer 1

创建分区后 table，运行以下命令将目录添加为分区

MSCK REPAIR TABLE table_name;

如果您有大量分区，您可能需要设置 hive.msck.repair.batch.size

When there is a large number of untracked partitions, there is a provision to run MSCK REPAIR TABLE batch wise to avoid OOME (Out of Memory Error). By giving the configured batch size for the property hive.msck.repair.batch.size it can run in the batches internally. The default value of the property is zero, it means it will execute all the partitions at once.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)

作者：

这可能会解决您的问题，但如果数据非常大，则无法解决。请参阅相关问题 here。

作为解决方法，还有另一种方法可以将分区一个一个地添加到 Hive Metastore，例如：

alter table table_name add partition(year=2016, month=10, day=11, hour=11)

我们编写了简单的脚本来自动执行此 alter 语句，目前看来它可以正常工作。

从 Azure HDInsights 中的分区镶木地板文件创建配置单元外部 table

Create hive external table from partitioned parquet files in Azure HDInsights

hive

azure

parquet

azure-hdinsight