Hive：LEFT JOIN 与 JOIN 在 ON 子句中使用过滤器给出不同的结果

Question

假设两个table：

    table1.c1   table1.c2
1   1           A
2   1           B
3   1           C
4   2           A
5   2           B

和

    table2.c1   table2.c2
1   2           A
2   2           D
3   3           A
4   3           B

当我这样做时：

select distinct t1.c1, t2.c2
from
schema.table1 t1
join
schema.table2 t2
on (t1.c2 = t2.c2 
    and t1.c1 = t2.c1
    and t1.c1 = 2)

在 Hive 中，我得到：

    t1.c1   t2.c2
1   2   A

这是预期的结果，没问题。但是，当我这样做时：

select distinct t1.c1, t2.c2
from
schema.table1 t1
left join
schema.table2 t2
on (t1.c2 = t2.c2 
    and t1.c1 = t2.c1
    and t1.c1 = 2)

我得到：

    t1.c1   t2.c2
1   1       NULL
2   2       NULL
3   2       A

因此，ON 子句中的过滤器似乎没有像我预期的那样工作。过滤器 t1.c1 = t2.c1 和 t1.c1 = 2 没有被应用，在 LEFT JOIN 中，它没有在第二个 table 上找到键所以 t2.c2 是 NULL.

我想答案一定在doc（可能在'Joins occur BEFORE WHERE CLAUSES'部分？）但我还是不明白其中的区别。

给出不同结果的过程是怎样的？

Answer 1

这就是 LEFT (OUTER) JOIN 的工作方式：

您在 ON 子句中指定了一些匹配条件。如果在 "right" table 中找到匹配行，则将其连接到 "left" table 中的行。如果没有匹配的行，它仍将 return "left" 行连同 "right" table 中的所有字段设置为空。因此它永远不会根据 ON 条件过滤 "left" table 中的任何行。使用 Hive-documentation 的术语：左边的 table 是 "preserved row table"，而右边的 table 是 "null supplying table"。

这与 INNER JOIN 相反，后者 return 只有在其他 table 中具有匹配伙伴的行。所以没有 "preserved table" 也不需要 "null supplying table"

Answer 2

LEFT JOIN 的输出应该与 FULL JOIN 不同。

LEFT join 的输出将包含左侧 table 的所有数据（两者中第一个写入），如果右侧 table 没有对应的数据，则为 NULL 值显示。如果您从查询中删除 distinct 并将其运行删除，输出应该会消除您对 LEFT/RIGHT 连接如何工作的困惑。

Full Join 输出

t1.c1   t1.c2   t2.c2
2       a       a
2       a       d
2       b       a
2       b       d

左连接输出

t1.c1   t1.c2   t2.c2
 1      a       null
 1      b       null
 1      c       null
 2      a       a
 2      a       d
 2      b       a
 2      b       d

Answer 3

Hive 显然以不同的方式处理内部连接和左连接中的连接条件。在 Inner Joins 中，您可以将过滤条件放入 ON 子句中，但在 Left Joins 中，您需要将主要 table（本例中为 t1）的过滤条件放入单独的 WHERE 子句中。如果你尝试

`select distinct t1.c1, t2.c2
from
schema.table1 t1
left join
schema.table2 t2
on (t1.c2 = t2.c2 
    and t1.c1 = t2.c1)
where t1.c1 = 2;`

你应该会得到预期的结果。

Hive：LEFT JOIN 与 JOIN 在 ON 子句中使用过滤器给出不同的结果

Hive: LEFT JOIN vs JOIN gives different results with filter in ON clause

hive

join

left-join