如何在最接近的近似值上进行 sql 连接

how to do an sql join on the closest approximation

我有一个数据 table,其中相关列包含一个数字。我想将其加入包含数字顺序列表的参考 table,我想将数据 table 中的每一行与参考 table 中最接近的数字(接近于最小差异)相匹配=33=].

我可以做类似的事情

Select top 1 ref_number 
from reference 
where ref_number < data_number
order by ref_number desc

做与ref_number > data_number类似的事情找到最小的差异然后加入。这会起作用,但是对于一个看似简单的操作来说会是一大堆代码,因为它需要为数据的每一行 table 遍历整个引用 table 两次,我认为这会非常也很慢(参考 table 有大约 25000 个条目,数据 table 有大约 100 万个条目)。

所以问题是:是否有更有效的方法来将引用 table 中最接近的数字与数据 table 中的每个条目相匹配?如果两个 table 都按它们的编号排序,那么进行匹配应该容易得多,但我看不到 SQL 代码来做到这一点。

这是我想要得到的简化示例。

引用table

ref_number  other reference columns 
1           ..
3
6
10

数据table

data_number 
1           
7
9.2

目标table

data_number ref_number other reference columns
1           1          ..
7           6
9.2         10

您可以为每个参考值生成两个区间:一个区间低于当前值,另一个区间高于当前值。然后通过 between 谓词的两个连接将为每个 data_number 分配两个参考值:一个值高于当前 data_number,另一个值较低。然后决定哪个更近。

代码如下(基于 Postgres fiddle,但语法对于大多数现代 DBMS 都是相同的)。

insert into base_tab (data_number)
values (0), (1), (4.5), (7), (9.2)
insert into ref_tab (ref_number)
values (1), (3), (6), (9)
with fromto as (
  select
    ref_number as num
    /*
      Identify the first and the last row
      in the data set to manage values below
      the lowest (in the reference data set)
      and above the highest
    */
    , lag(0, 1, 1)
      over(order by ref_number asc)
      as first_row
    , lead(0, 1, 1)
      over(order by ref_number asc)
      as last_row
    /*
      And add value ranges (before and after current ref_num)
      to match data_number as closest above
      or closest below
    */
    , lead(ref_number, 1, 999999999999999.0)
      over(order by ref_number asc)
      as next_num
    , lag(ref_number, 1, -999999999999999.0)
      over(order by ref_number asc)
      as prev_num
  from ref_tab
)

select
  b.*
  , case
      /*
        When the data_number is below the lowest
        in the reference data set, then use the lowest
        value from reference data set
      */
      when up_.first_row = 1
      then up_.num
      /*
        When the data_number is above the highest
        in the reference data set, then use the highest
        value from reference data set
      */
      when down_.last_row = 1
      then down_.num
      /*
        Assign the closest value from the reference
        data set with minimal difference.
        Lower value has higher priority
      */
      when b.data_number - down_.num
        <= up_.num - b.data_number
      then down_.num
      when up_.num - b.data_number
        > b.data_number - down_.num
      then up_.num
    end as ref_num
from base_tab as b
  /*Test if data_number is below current ref_num*/
  left join fromto as up_
    on
      b.data_number >= up_.prev_num
      and b.data_number < up_.num
  /*Test if data_number is above current ref_num*/
  left join fromto as down_
    on
      b.data_number >= down_.num
      and b.data_number < down_.next_num
data_number | ref_num
----------: | ------:
          0 |       1
          1 |       1
        4.5 |       3
          7 |       6
        9.2 |       9

db<>fiddle here

UPD:请注意,范围连接无论如何都需要循环,因为没有任何相等性,DBMS 需要测试每一行(整个 table 或从某个开始排序数据集中的点)。您可能会尝试通过在程序扩展(SQL 脚本在 HANA 的情况下)中对排序数据(排序合并连接)进行显式循环来实现一些改进,但我认为这是优化器的工作,应该在隐含的场景。

您可以使用RANK window function 来实现这种连接。请找到具有相关模式的片段:

CREATE TABLE tab_ref (id integer, value double);
INSERT INTO tab_ref VALUES (1, 1.0);
INSERT INTO tab_ref VALUES (2, 3.0);
INSERT INTO tab_ref VALUES (3, 6.0);
INSERT INTO tab_ref VALUES (4, 10.0);

CREATE TABLE tab_data (id integer, value double);
INSERT INTO tab_data VALUES (1, 1.0);
INSERT INTO tab_data VALUES (2, 7.0);
INSERT INTO tab_data VALUES (3, 9.2);
    
    
SELECT *
FROM
(
    SELECT *, RANK() OVER (PARTITION BY b.id ORDER BY ABS(a.value - b.value)) AS rnk
    FROM tab_ref a, tab_data b
)
WHERE rnk = 1