Hive：由于上一个 Reducer 作业，内部连接查询永远执行

Question

在 Azure HDInsight 3.6 上使用 Hive 1.2.1000.2 执行 INNER JOIN 以获取同时出现在 Table_1 和 Table_2.

中的记录数

表格详情：

Table_1: 310M 记录

示例数据：

master_id        modelkey     order_id  
---------------------------------------
mi0000bd1444     4874         d988e53cd
mi000097d5       44365        p0905gd44
mi0000d2ab09ea   309141         
mi0001d6a        8705         7574  
mi00011f7c085    4063         d165804b2
mi0001a57db      314          9c84ft879

Table_2: 35M 记录

示例数据：

order_id    vendor_id
---------------------------------------
81d162f23   7122a0c
6988e53cd   517ba6e
5165804b2   5c5e161
47ba91ea3   7686b2d
f45cab9de   35be1af

以下是我迄今为止尝试过的详细信息。

Hive 查询：

SELECT COUNT(*) 
FROM db.table_1 t1
INNER JOIN db.table_2 t2 ON t1.order_id = t2.order_id;

配置单元属性：

SET hive.tez.container.size=10240;
SET tez.am.resource.memory.mb=10240;
SET tez.task.resource.memory.mb=10240;
SET hive.execution.engine=tez;
SET hive.exec.compress.output=true;
SET hive.vectorized.execution.enabled = true;
SET hive.vectorized.execution.reduce.enabled = true;
SET hive.optimize.skewjoin=true;
SET hive.skewjoin.key=100000;

查询执行时间超过 7 小时，卡在最后一个 Reducer 作业上

--------------------------------------------------------------------------------
    VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Map 4 ..........   SUCCEEDED    715        715        0        0       0       0
Reducer 2 .....      RUNNING    189        188        1        0       0       0
Reducer 3            RUNNING      1          0        1        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/04  [=========================>>-] 99%   ELAPSED TIME: 25307.97 s  
--------------------------------------------------------------------------------

有没有办法克服最后一个 Reducer 的问题并得到结果？

解释：

Answer 1

执行了以下步骤，很有帮助！并希望对其他人有所帮助：

删除了没有价值的记录，即 order_id=''
分批执行 JOIN，而不是一次完成所有操作
参考以下设置某些配置单元属性：

hive properties

Hive：由于上一个 Reducer 作业，内部连接查询永远执行

Hive: Inner Join query executing forever due to last Reducer job

hadoop

hive

join

hiveql

azure-hdinsight