使用 hive/impala 或其他方式通过子字符串加入大 table 的有效方法

Question

我有 2 table tabl1:

+-------+--------+--------+----------+
| att1  |  att2  | att3   | att4     |
+-------+--------+--------+----------+
|  abcd | ava012 | df012f | afsdaldf |
.......

和tabl2：

+----+
| val|
+----+
| 012|
...

tabl2 包含的数字可以是 tabl1 的 4 列中的一列或多列中的子字符串。两个 tables 都很大 tables 包含数百万条记录。我试图连接 tabl1 列并在其中搜索，但查询永远不会结束。有没有一种有效的方法来做到这一点。也许将整个 table 转换为一个 txt 文件并在其中搜索？也在关注以下是我的一些试验示例（均在 Hive 中）：

SELECT a.*, b.*
from tabl1 a, tabl2 b
where  
instr (
concat ( (cast (a.att1 as string), (cast (a.att2 as string), 
(cast (a.att3 as string), (cast (a.att4 as string) ) , (cast (b.val as string) ) ) > 0

或

  SELECT a.*, b.*
    from tabl1 a, tabl2 b
    where  
    concat ( (cast (a.att1 as string), (cast (a.att2 as string), 
(cast (a.att3 as string), (cast (a.att4 as string) ) 
like  concat ('%',(cast (b.val as string),'%')

还有一些 REGEX 但运行时间无穷无尽...

Answer 1

select  *

from           (select  *
                from    tabl1 t1
                        lateral view explode(split(regexp_replace(trim(regexp_replace(concat_ws(',',att1,att2,att3,att4),'\D+',' ')),'(?<=^| )(?<token>.*?) (?=.*(?<= )\k<token>(?= |$))',''),' ')) e as val
                ) t1

        join    tabl2 t2

        on      t2.val = 
                t1.val

使用 hive/impala 或其他方式通过子字符串加入大 table 的有效方法

Efficient way to join big table by sub strings using hive/ impala or other ways

string

hadoop

hive

join

impala