Hive(Bigdata)-分桶和索引的区别

Hive(Bigdata)- difference between bucketing and indexing

Hive 中 table 的分桶和索引之间的主要区别是什么？

主要区别在于目标：

索引

The goal of Hive indexing is to improve the speed of query lookup on certain columns of a table. Without an index, queries with predicates like 'WHERE tab1.col1 = 10' load the entire table or partition and process all the rows. But if an index exists for col1, then only a portion of the file needs to be loaded and processed.

当表变得非常大时，索引变得更加重要，正如您现在无疑知道的那样，Hive 在大表上茁壮成长。

分桶

它通常用于连接操作，因为您可以通过按特定 'key' 或 'id' 存储记录来优化连接。这样当你想做join操作的时候，相同'key'的记录会在同一个bucket中，这样join操作会更快。您可以将其视为一种将数据集分解为更易于管理的部分的技术。 link 为您提供了 5 个高效 Hive 查询的技巧，其中之一是关于 Bucketing 的。

Hive(Bigdata)-分桶和索引的区别

Hive(Bigdata)- difference between bucketing and indexing

hadoop

hive

mapreduce

bigdata