如何在最接近的近似值上进行 sql 连接
how to do an sql join on the closest approximation
我有一个数据 table,其中相关列包含一个数字。我想将其加入包含数字顺序列表的参考 table,我想将数据 table 中的每一行与参考 table 中最接近的数字(接近于最小差异)相匹配=33=].
我可以做类似的事情
Select top 1 ref_number
from reference
where ref_number < data_number
order by ref_number desc
做与ref_number > data_number
类似的事情找到最小的差异然后加入。这会起作用,但是对于一个看似简单的操作来说会是一大堆代码,因为它需要为数据的每一行 table 遍历整个引用 table 两次,我认为这会非常也很慢(参考 table 有大约 25000 个条目,数据 table 有大约 100 万个条目)。
所以问题是:是否有更有效的方法来将引用 table 中最接近的数字与数据 table 中的每个条目相匹配?如果两个 table 都按它们的编号排序,那么进行匹配应该容易得多,但我看不到 SQL 代码来做到这一点。
这是我想要得到的简化示例。
引用table
ref_number other reference columns
1 ..
3
6
10
数据table
data_number
1
7
9.2
目标table
data_number ref_number other reference columns
1 1 ..
7 6
9.2 10
您可以为每个参考值生成两个区间:一个区间低于当前值,另一个区间高于当前值。然后通过 between
谓词的两个连接将为每个 data_number
分配两个参考值:一个值高于当前 data_number
,另一个值较低。然后决定哪个更近。
代码如下(基于 Postgres fiddle,但语法对于大多数现代 DBMS 都是相同的)。
insert into base_tab (data_number)
values (0), (1), (4.5), (7), (9.2)
insert into ref_tab (ref_number)
values (1), (3), (6), (9)
with fromto as (
select
ref_number as num
/*
Identify the first and the last row
in the data set to manage values below
the lowest (in the reference data set)
and above the highest
*/
, lag(0, 1, 1)
over(order by ref_number asc)
as first_row
, lead(0, 1, 1)
over(order by ref_number asc)
as last_row
/*
And add value ranges (before and after current ref_num)
to match data_number as closest above
or closest below
*/
, lead(ref_number, 1, 999999999999999.0)
over(order by ref_number asc)
as next_num
, lag(ref_number, 1, -999999999999999.0)
over(order by ref_number asc)
as prev_num
from ref_tab
)
select
b.*
, case
/*
When the data_number is below the lowest
in the reference data set, then use the lowest
value from reference data set
*/
when up_.first_row = 1
then up_.num
/*
When the data_number is above the highest
in the reference data set, then use the highest
value from reference data set
*/
when down_.last_row = 1
then down_.num
/*
Assign the closest value from the reference
data set with minimal difference.
Lower value has higher priority
*/
when b.data_number - down_.num
<= up_.num - b.data_number
then down_.num
when up_.num - b.data_number
> b.data_number - down_.num
then up_.num
end as ref_num
from base_tab as b
/*Test if data_number is below current ref_num*/
left join fromto as up_
on
b.data_number >= up_.prev_num
and b.data_number < up_.num
/*Test if data_number is above current ref_num*/
left join fromto as down_
on
b.data_number >= down_.num
and b.data_number < down_.next_num
data_number | ref_num
----------: | ------:
0 | 1
1 | 1
4.5 | 3
7 | 6
9.2 | 9
db<>fiddle here
UPD:请注意,范围连接无论如何都需要循环,因为没有任何相等性,DBMS 需要测试每一行(整个 table 或从某个开始排序数据集中的点)。您可能会尝试通过在程序扩展(SQL 脚本在 HANA 的情况下)中对排序数据(排序合并连接)进行显式循环来实现一些改进,但我认为这是优化器的工作,应该在隐含的场景。
您可以使用RANK window function 来实现这种连接。请找到具有相关模式的片段:
CREATE TABLE tab_ref (id integer, value double);
INSERT INTO tab_ref VALUES (1, 1.0);
INSERT INTO tab_ref VALUES (2, 3.0);
INSERT INTO tab_ref VALUES (3, 6.0);
INSERT INTO tab_ref VALUES (4, 10.0);
CREATE TABLE tab_data (id integer, value double);
INSERT INTO tab_data VALUES (1, 1.0);
INSERT INTO tab_data VALUES (2, 7.0);
INSERT INTO tab_data VALUES (3, 9.2);
SELECT *
FROM
(
SELECT *, RANK() OVER (PARTITION BY b.id ORDER BY ABS(a.value - b.value)) AS rnk
FROM tab_ref a, tab_data b
)
WHERE rnk = 1
我有一个数据 table,其中相关列包含一个数字。我想将其加入包含数字顺序列表的参考 table,我想将数据 table 中的每一行与参考 table 中最接近的数字(接近于最小差异)相匹配=33=].
我可以做类似的事情
Select top 1 ref_number
from reference
where ref_number < data_number
order by ref_number desc
做与ref_number > data_number
类似的事情找到最小的差异然后加入。这会起作用,但是对于一个看似简单的操作来说会是一大堆代码,因为它需要为数据的每一行 table 遍历整个引用 table 两次,我认为这会非常也很慢(参考 table 有大约 25000 个条目,数据 table 有大约 100 万个条目)。
所以问题是:是否有更有效的方法来将引用 table 中最接近的数字与数据 table 中的每个条目相匹配?如果两个 table 都按它们的编号排序,那么进行匹配应该容易得多,但我看不到 SQL 代码来做到这一点。
这是我想要得到的简化示例。
引用table
ref_number other reference columns
1 ..
3
6
10
数据table
data_number
1
7
9.2
目标table
data_number ref_number other reference columns
1 1 ..
7 6
9.2 10
您可以为每个参考值生成两个区间:一个区间低于当前值,另一个区间高于当前值。然后通过 between
谓词的两个连接将为每个 data_number
分配两个参考值:一个值高于当前 data_number
,另一个值较低。然后决定哪个更近。
代码如下(基于 Postgres fiddle,但语法对于大多数现代 DBMS 都是相同的)。
insert into base_tab (data_number) values (0), (1), (4.5), (7), (9.2)
insert into ref_tab (ref_number) values (1), (3), (6), (9)
with fromto as ( select ref_number as num /* Identify the first and the last row in the data set to manage values below the lowest (in the reference data set) and above the highest */ , lag(0, 1, 1) over(order by ref_number asc) as first_row , lead(0, 1, 1) over(order by ref_number asc) as last_row /* And add value ranges (before and after current ref_num) to match data_number as closest above or closest below */ , lead(ref_number, 1, 999999999999999.0) over(order by ref_number asc) as next_num , lag(ref_number, 1, -999999999999999.0) over(order by ref_number asc) as prev_num from ref_tab ) select b.* , case /* When the data_number is below the lowest in the reference data set, then use the lowest value from reference data set */ when up_.first_row = 1 then up_.num /* When the data_number is above the highest in the reference data set, then use the highest value from reference data set */ when down_.last_row = 1 then down_.num /* Assign the closest value from the reference data set with minimal difference. Lower value has higher priority */ when b.data_number - down_.num <= up_.num - b.data_number then down_.num when up_.num - b.data_number > b.data_number - down_.num then up_.num end as ref_num from base_tab as b /*Test if data_number is below current ref_num*/ left join fromto as up_ on b.data_number >= up_.prev_num and b.data_number < up_.num /*Test if data_number is above current ref_num*/ left join fromto as down_ on b.data_number >= down_.num and b.data_number < down_.next_num
data_number | ref_num ----------: | ------: 0 | 1 1 | 1 4.5 | 3 7 | 6 9.2 | 9
db<>fiddle here
UPD:请注意,范围连接无论如何都需要循环,因为没有任何相等性,DBMS 需要测试每一行(整个 table 或从某个开始排序数据集中的点)。您可能会尝试通过在程序扩展(SQL 脚本在 HANA 的情况下)中对排序数据(排序合并连接)进行显式循环来实现一些改进,但我认为这是优化器的工作,应该在隐含的场景。
您可以使用RANK window function 来实现这种连接。请找到具有相关模式的片段:
CREATE TABLE tab_ref (id integer, value double);
INSERT INTO tab_ref VALUES (1, 1.0);
INSERT INTO tab_ref VALUES (2, 3.0);
INSERT INTO tab_ref VALUES (3, 6.0);
INSERT INTO tab_ref VALUES (4, 10.0);
CREATE TABLE tab_data (id integer, value double);
INSERT INTO tab_data VALUES (1, 1.0);
INSERT INTO tab_data VALUES (2, 7.0);
INSERT INTO tab_data VALUES (3, 9.2);
SELECT *
FROM
(
SELECT *, RANK() OVER (PARTITION BY b.id ORDER BY ABS(a.value - b.value)) AS rnk
FROM tab_ref a, tab_data b
)
WHERE rnk = 1