使用范围连接减少记录数
Reduction in the number of records using range join
关注
我有以下 tables 第一个(范围)包括值范围和附加列:
row | From | To | Country ....
-----|--------|---------|---------
1 | 1200 | 1500 |
2 | 2200 | 2700 |
3 | 1700 | 1900 |
4 | 2100 | 2150 |
...
From
和To
是bigint并且是互斥的。范围 table 包括 180 万条记录。其他 table(值)包含 2.7M 记录,看起来像:
row | Value | More columns....
--------|--------|----------------
1 | 1777 |
2 | 2122 |
3 | 1832 |
4 | 1340 |
...
我想创建一个 table 如下:
row | Value | From | To | More columns....
--------|--------|--------|-------|---
1 | 1777 | 1700 | 1900 |
2 | 2122 | 2100 | 2150 |
3 | 1832 | 1700 | 1900 |
4 | 1340 | 1200 | 1500 |
...
我在下面的代码中使用了左外连接:
set n=1000;
select v.id
,v.val
,r.from_val
,r.to_val
from val v
left outer join
(select r.*
,floor(from_val/${hiveconf:n}) + pe.i as match_val
from val_range r
lateral view posexplode
(
split
(
space
(
cast
(
floor(to_val/${hiveconf:n})
- floor(from_val/${hiveconf:n})
as int
)
)
,' '
)
) pe as i,x
) r
on floor(v.val/${hiveconf:n}) =
r.match_val
where v.val between r.from_val and r.to_val
order by v.id
;
然而,在 2.7M 中,新的 table ~31k 记录的记录数量大幅减少。我用left outer join
怎么行?我该如何解决?
假设我们有一个 v.id
set n=1000;
select v.id
,r.from_val
,r.to_val
from val v
left join (select v.id
,r.from_val
,r.to_val
from val v
join (...) r
on floor(v.val/${hiveconf:n}) =
r.match_val
where v.val between r.from_val and r.to_val
) r
on r.id =
v.id
order by v.id
至于 OP 请求,这里是完整的查询:
set n=1000;
select v.id
,r.from_val
,r.to_val
from val v
left join (select v.id
,r.from_val
,r.to_val
from val v
join (select r.*
,floor(from_val/${hiveconf:n}) + pe.i as match_val
from val_range r
lateral view posexplode
(
split
(
space
(
cast
(
floor(to_val/${hiveconf:n})
- floor(from_val/${hiveconf:n})
as int
)
)
,' '
)
) pe as i,x
) r
on floor(v.val/${hiveconf:n}) =
r.match_val
where v.val between r.from_val and r.to_val
) r
on r.id =
v.id
order by v.id
关注
row | From | To | Country ....
-----|--------|---------|---------
1 | 1200 | 1500 |
2 | 2200 | 2700 |
3 | 1700 | 1900 |
4 | 2100 | 2150 |
...
From
和To
是bigint并且是互斥的。范围 table 包括 180 万条记录。其他 table(值)包含 2.7M 记录,看起来像:
row | Value | More columns....
--------|--------|----------------
1 | 1777 |
2 | 2122 |
3 | 1832 |
4 | 1340 |
...
我想创建一个 table 如下:
row | Value | From | To | More columns....
--------|--------|--------|-------|---
1 | 1777 | 1700 | 1900 |
2 | 2122 | 2100 | 2150 |
3 | 1832 | 1700 | 1900 |
4 | 1340 | 1200 | 1500 |
...
我在下面的代码中使用了左外连接:
set n=1000;
select v.id
,v.val
,r.from_val
,r.to_val
from val v
left outer join
(select r.*
,floor(from_val/${hiveconf:n}) + pe.i as match_val
from val_range r
lateral view posexplode
(
split
(
space
(
cast
(
floor(to_val/${hiveconf:n})
- floor(from_val/${hiveconf:n})
as int
)
)
,' '
)
) pe as i,x
) r
on floor(v.val/${hiveconf:n}) =
r.match_val
where v.val between r.from_val and r.to_val
order by v.id
;
然而,在 2.7M 中,新的 table ~31k 记录的记录数量大幅减少。我用left outer join
怎么行?我该如何解决?
假设我们有一个 v.id
set n=1000;
select v.id
,r.from_val
,r.to_val
from val v
left join (select v.id
,r.from_val
,r.to_val
from val v
join (...) r
on floor(v.val/${hiveconf:n}) =
r.match_val
where v.val between r.from_val and r.to_val
) r
on r.id =
v.id
order by v.id
至于 OP 请求,这里是完整的查询:
set n=1000;
select v.id
,r.from_val
,r.to_val
from val v
left join (select v.id
,r.from_val
,r.to_val
from val v
join (select r.*
,floor(from_val/${hiveconf:n}) + pe.i as match_val
from val_range r
lateral view posexplode
(
split
(
space
(
cast
(
floor(to_val/${hiveconf:n})
- floor(from_val/${hiveconf:n})
as int
)
)
,' '
)
) pe as i,x
) r
on floor(v.val/${hiveconf:n}) =
r.match_val
where v.val between r.from_val and r.to_val
) r
on r.id =
v.id
order by v.id