Hadoop 分区。您如何有效地设计 Hive/Impala table？

Question

考虑到以下事实，您如何有效地设计 Hive/Impala table？

Answer 1

如果您正在对这些数据进行分析，那么 Impala 的可靠选择是使用 Parquet 格式。对我们的用户来说效果很好的方法是根据记录中的日期值按年、月、日对日期进行分区。

因此，例如 CREATE TABLE foo (tool_id int, eff_dt timestamp) partition (year int, month int, day int) 存储为 parquet

将数据加载到这个 table 中时，我们使用类似这样的东西来创建动态分区：

INSERT INTO foo partition (year, month, day)
SELECT tool_id, eff_dt, year(eff_dt), month(eff_dt), day(eff_dt)
FROM source_table;

然后你训练你的用户，如果他们想要最好的性能，将 YEAR、MONTH、DAY 添加到他们的 WHERE 子句，以便它命中分区以获得更好的性能。然后让他们在 SELECT 语句中添加 eff_dt，以便他们在最终结果中看到他们喜欢的格式的日期值。

在 CDH 中，Parquet 默认以 256MB 块（可配置）存储数据。以下是配置方法：http://www.cloudera.com/documentation/enterprise/latest/topics/impala_parquet_file_size.html

Hadoop partitioning. How do you efficiently design a Hive/Impala table?