Hive：如果两个表之间满足条件，则查找唯一值

Question

我有两个 table。 Table 1 有我感兴趣的所有独特地方（30 行）：

places
japan
china
india
...

Table 2 包含 ID、访问地点和日期的所有信息。

id	places	date
10001	japan	20210204
10001	australia	20210204
10001	china	20210204
10001	argentina	20210205
10002	spain	20210204
10002	india	20210204
10002	china	20210205
10003	argentina	20210204
10003	portugal	20210204

我感兴趣的是：

特定日期（比如 20210204）
从 Table 2 中找到所有访问过 Table 中的至少一个 places 的唯一 IDs 1
将那些唯一 IDs 保存到临时文件 table。

这是我尝试过的：

create temporary table imp.unique_ids_tmp
as select distinct(final.id) from
(select t2.id
from table2 as t2
where t2.date = '20210204'
and t2.places in 
(select * from table1)) final;

我正在努力合并“至少一个”逻辑，以便一旦找到令人满意的 id，它就会停止查看那些 id 记录。

Answer 1

使用left semi join（以有效的方式实现不相关的EXISTS），它将只过滤连接的记录，然后应用不同的：

create temporary table imp.unique_ids_tmp as
select distinct t2.id --distinct is not a function, do not need ()
  from table2 t2
       left semi join table1 t1 on t2.places = t1.places
 where t2.date = '20210204'
;

满足“至少一次”条件：没有连接记录的 ID 不会出现在数据集中。

另一种方法是使用相关的 EXISTS:

create temporary table imp.unique_ids_tmp as
select distinct t2.id --distinct is not a function, do not need ()
  from table2 t2
 where t2.date = '20210204' 
   --this condition is true as soon as one match is found
   and exists (select 1 from table1 t1 where t2.places = t1.places)
;

IN 也可以。

相关的 EXIST 看起来接近“一旦找到满意的 id，它就停止查看那些 id 记录”，但是所有这些方法都是在 Hive 中使用 JOIN 实现的。执行 EXPLAIN 你会看到，它会生成相同的计划，尽管它取决于你的版本中的实现。潜在的 EXISTS 可能会更快，因为不需要检查子查询中的所有记录。考虑到您的 table1 有 30 行足够小以适合内存，MAP-JOIN (set hive.auto.convert.join=true;) 将为您提供最佳性能。

使用数组或 IN(static_list) 的另一种快速方法。它可用于小型和静态数组。有序数组可能会给你更好的性能：

select distinct t2.id --distinct is not a function, do not need ()
  from table2 t2
 where t2.date = '20210204'
       and array_contains(array('australia', 'china', 'japan', ... ), t2.places)
       --OR use t2.places IN ('australia', 'china', 'japan', ... )

为什么这种方法更快：因为不需要启动mapper和计算splits来从hdfs读取table，所以只会读取table2。缺点是值列表是静态的。另一方面，您可以将整个列表作为参数传递，请参阅 .

Hive：如果两个表之间满足条件，则查找唯一值

Hive: Find unique values if condition met between two tables

sql

hive

hiveql