使用 hive/impala 或其他方式通过子字符串加入大 table 的有效方法
Efficient way to join big table by sub strings using hive/ impala or other ways
我有 2 table tabl1
:
+-------+--------+--------+----------+
| att1 | att2 | att3 | att4 |
+-------+--------+--------+----------+
| abcd | ava012 | df012f | afsdaldf |
.......
和tabl2
:
+----+
| val|
+----+
| 012|
...
tabl2
包含的数字可以是 tabl1
的 4 列中的一列或多列中的子字符串。
两个 tables 都很大 tables 包含数百万条记录。
我试图连接 tabl1
列并在其中搜索,但查询永远不会结束。
有没有一种有效的方法来做到这一点。也许将整个 table 转换为一个 txt
文件并在其中搜索?
也在关注
以下是我的一些试验示例(均在 Hive 中):
SELECT a.*, b.*
from tabl1 a, tabl2 b
where
instr (
concat ( (cast (a.att1 as string), (cast (a.att2 as string),
(cast (a.att3 as string), (cast (a.att4 as string) ) , (cast (b.val as string) ) ) > 0
或
SELECT a.*, b.*
from tabl1 a, tabl2 b
where
concat ( (cast (a.att1 as string), (cast (a.att2 as string),
(cast (a.att3 as string), (cast (a.att4 as string) )
like concat ('%',(cast (b.val as string),'%')
还有一些 REGEX
但运行时间无穷无尽...
select *
from (select *
from tabl1 t1
lateral view explode(split(regexp_replace(trim(regexp_replace(concat_ws(',',att1,att2,att3,att4),'\D+',' ')),'(?<=^| )(?<token>.*?) (?=.*(?<= )\k<token>(?= |$))',''),' ')) e as val
) t1
join tabl2 t2
on t2.val =
t1.val
我有 2 table tabl1
:
+-------+--------+--------+----------+
| att1 | att2 | att3 | att4 |
+-------+--------+--------+----------+
| abcd | ava012 | df012f | afsdaldf |
.......
和tabl2
:
+----+
| val|
+----+
| 012|
...
tabl2
包含的数字可以是 tabl1
的 4 列中的一列或多列中的子字符串。
两个 tables 都很大 tables 包含数百万条记录。
我试图连接 tabl1
列并在其中搜索,但查询永远不会结束。
有没有一种有效的方法来做到这一点。也许将整个 table 转换为一个 txt
文件并在其中搜索?
也在关注
SELECT a.*, b.*
from tabl1 a, tabl2 b
where
instr (
concat ( (cast (a.att1 as string), (cast (a.att2 as string),
(cast (a.att3 as string), (cast (a.att4 as string) ) , (cast (b.val as string) ) ) > 0
或
SELECT a.*, b.*
from tabl1 a, tabl2 b
where
concat ( (cast (a.att1 as string), (cast (a.att2 as string),
(cast (a.att3 as string), (cast (a.att4 as string) )
like concat ('%',(cast (b.val as string),'%')
还有一些 REGEX
但运行时间无穷无尽...
select *
from (select *
from tabl1 t1
lateral view explode(split(regexp_replace(trim(regexp_replace(concat_ws(',',att1,att2,att3,att4),'\D+',' ')),'(?<=^| )(?<token>.*?) (?=.*(?<= )\k<token>(?= |$))',''),' ')) e as val
) t1
join tabl2 t2
on t2.val =
t1.val