为什么蜂巢中的桶数应该等于减速器的数量？

Question

在 hive 中，为什么 bucket 的数量应该等于 reducer 的数量？

Answer 1

因为这是最优化的 mapreduce 工作方式（其他条件相同）。任务将在减速器之间分配。

在配置单元 0.x 和 1.x 中，您必须指定以下内容：hive.enforce.bucketing = true。这意味着 reducer 的数量将根据 table 中的桶数自动确定。在更高版本的配置单元 (2.x) 中，这是默认设置的。

来源：https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables

Answer 2

插入分桶 table 时启动的减速器数量是 table 中分桶数量的 除数。选择最接近最大减速器集的除数，并启动许多减速器。

示例：

Num of buckets in a table 5956. hive.exec.reducers.max=1009 divisors of 5956=1489*4 number of launched reducers: 4

因此可以启动 1489 个或 4 个减速器，但由于可以启动的最大减速器是 1009 个，因此只有 4 个减速器运行这可能需要十年才能运行大型 table.

设置hive.exec.reducers.max=2000 将启动 1489 个减速器。

为什么蜂巢中的桶数应该等于减速器的数量？

Why number of buckets in hive should be equal to number of reducers?

apache

hadoop

hive

buckets

partitioning