Hive on TEZ 查询永远在 Reducer 交叉产品上

Question

我有 2 个 table：

db1.main_table (32 GB)
db2.lookup_table (2.5 KB)

lookup table 只有一个名为 id 的列，它也存在并且是 main_ table 的主键。目标是仅查看 lookup table 中的值并从 main_table 中删除所有具有这些值的行。我正在使用这个配置单元查询（在 TEZ 上），它突然开始在 Reduce 上创建交叉产品舞台。

insert overwrite table 
db1.main_table 
select * from db1.main_table where nvl(id,'NV') not in (select nvl(id,'RV') from db2.lookup_table);

我正在使用 nvl，因为在我不想丢失的主要 table id 列中存在空值。

我的查询永远挂在 Reducer 2（只有 3 个容器）上。

我收到 Reducer 2 的警告

INFO : Warning: Shuffle Join MERGEJOIN[34][tables = [$hdt$_0, $hdt$_1]] in Stage 'Reducer 2' is a cross product

我正在为这个查询获得以下计划，该计划在 TEZ 的 Reducer 2 顶点处被挂起。

我们能否建议一种方法，使 Reducer 2 可以获得更多容器或解决这个非常长的运行工作的方法。解决方案将不胜感激。

Answer 1

如果查找 table 可以包含许多带 NULL 的记录，这意味着至少 'RV' 条记录在您的查询中不是唯一的，最好使用 DISTINCT 来减小大小加入前的查找。但是你说的是“..id，它也存在并且是 main_ table 的主键......”主键是唯一的并且不是 NULL。如果 PK 约束确实由进程加载查找 table 强制执行，则不需要 DISTINCT 和 NVL。 main table 也是如此。 PK = 唯一+NOT NULL.
如果 main table 有很多 NULL，并且在加入之前它们都会被转换为 'NV'，这个值会在 JOIN reducer 上造成倾斜。如果应该传递 'NV'，您完全可以将其从联接中排除。
这是最重要的一个。如果 Lookup table 足够小以适合内存，请使用 Map-Join。阅读有关 mapjoin 的问题：而且它相当小：(2.5 KB) - Map-join 应该绝对有效。

set hive.auto.convert.join=true; 
set hive.mapjoin.smalltable.filesize=157286400; --adjust the figure for mapjoin

insert overwrite table db1.main_table 
select m.* 
  from db1.main_table m 
       left join (select DISTINCT nvl(id,'RV') id from db2.lookup_table) l 
              on m.id=l.id --NULLs are not joined
 where l.id is NULL --Not joined records, including NULLs in main table
;

Hive on TEZ 查询永远在 Reducer 交叉产品上

Hive on TEZ query taking forever at Reducer cross product

hadoop

hive

query-optimization

hiveql

apache-tez