hive中分区和索引的区别

Difference between partition and index in hive

我是 hadoop 和 hive 的新手,我会知道 hive 中的索引和分区有什么区别?什么时候用索引,什么时候分区?

谢谢!

索尼娅,

下面是书中的一个部分,可能对您有用。

"Hive has limited indexing capabilities. There are no keys in the usual relational database sense, but you can build an index on columns to speed some operations. The index data for a table is stored in another table. Also, the feature is relatively new, so it doesn’t have a lot of options yet. However, the indexing process is designed to be customizable with plug-in Java code, so teams can extend the feature to meet their needs. Indexing is also a good alternative to partitioning when the logical partitions would actually be too numerous and small to be useful. Indexing can aid in pruning some blocks from a table as input for a MapReduce job. Not all queries can benefit from an index—the EXPLAIN syntax and Hive can be used to determine if a given query is aided by an index. Indexes in Hive, like those in relational databases, need to be evaluated carefully.

Maintaining an index requires extra disk space and building an index has a processing cost. The user must weigh these costs against the benefits they offer when querying a table."

Hive 编程书第 117 页

索引是新的和不断发展的(正在添加功能),但目前索引仅限于单个 tables,不能与外部 tables 一起使用。创建索引会创建一个单独的 table。索引可以分区(匹配基table的分区)。索引用于在 table 秒内加快数据搜索速度。

分区在 hdfs 级别提供数据隔离,为每个分区创建子目录。分区允许限制查询中读取的文件数量和搜索的数据量。但是,要发生这种情况,必须在 WHERE 子句中指定分区列。

在构建数据模型时,您可以根据数据大小和预期的使用模式确定索引 and/or 分区的最佳使用方式。