select presto 上多列的不同连接

Question

我有两张桌子。

因此表 1 看起来像：

+--------+-------------+--------
|id1 | id2 |  date     | degree |
+--------+-------------+--------
|  1 |  10 |  20200101 |   1    |
|  1 |  11 |  20200101 |   1    |
|  1 |  11 |  20200101 |   1    |
|  2 |  52 |  20200101 |   2 .  |
|  2 |  52 |  20200101 |   2 .  |
|  2 |  53 |  20200101 | . 2 .  |
|  3 |  21 |  20200101 |   2 .  |
| ...| ... |  ...      |  ...   |
+--------+-----------+----------

而表 2 是：

 +--------+------------+-------+-------
|id1 | id2 |  date     | price | rank |
+--------+-------------+-------+-------
|  1 |  10 |  20200101 |  1200 | 1    |
|  1 |  10 |  20200101 |  1200 | 2    |
|  1 |  10 |  20200101 |       |      |
|  1 |  10 |  20200101 |  1300 | 1    |
|  1 |  10 |  20200101 |  1300 | 2    |
| ...| ... |  ...      |   ... |...   |
+--------+-----------+-----------------

我想做什么来从表 2 中获取价格列并将其添加到基于三列 id1、id2 和日期的表 1 中。如果我像这样进行简单的连接

select tab1.id1, tab1.id2, tab1.date, tab2.price
from tab1
left join tab2
on tab1.id1 = tab2.id1
and tab1.id2 = tab2.id2
and tab1.date = tab2.date

这就是我们拥有的：

 +--------+------------+----------------
|id1 | id2 |  date     | price | degree |
+--------+-------------+----------------
|  1 |  10 |  20200101 |  1200 |   1    |
|  1 |  10 |  20200101 |  1200 |   1    |
|  1 |  10 |  20200101 |       |   1    |
|  1 |  10 |  20200101 |  1300 |   1    |
|  1 |  10 |  20200101 |  1300 |   1    |
+--------+-----------+-------------------

但其实我想要的是这个：

 +--------+------------+----------------
|id1 | id2 |  date     | price | degree |
+--------+-------------+----------------
|  1 |  10 |  20200101 |  1200 | . 1 .  |
|  1 |  10 |  20200101 |  1300 |   1 .  |
+--------+-----------+-------------------

Answer 1

使用群组

select * from (
 select tab1.id1 as id1, tab1.id2 as id2, tab1.date as date, tab2.price as price
 from tab1
 left join tab2
 on tab1.id1 = tab2.id1
 and tab1.id2 = tab2.id2
 and tab1.date = tab2.date) as t group by t.id1,t.id2,t.date,t.price

Answer 2

这涉及对您的数据的一些推测，但根据您的示例，如果您将排名列限制为值 1，它似乎会给出所需的结果。

select
  tab1.id1, tab1.id2, tab1.date, tab2.price
from
  tab1
  join tab2 on
    tab1.id1 = tab2.id1 and
    tab1.id2 = tab2.id2 and
    tab1.date = tab2.date and
    tab2.rank = 1 -- add this line

当然，如果整个数据集都不是这样，那么这将行不通。

在大多数情况下，我喜欢避免使用 select distinct 及其派生词（包括按每一列分组，这本质上是 select 不同的），因为它给人一种非常随意的感觉——只需删除碰巧相同的任何记录。相反，我认为最好了解您的数据并知道为什么某些记录会被筛选掉。

例如，如果您确实想要选择具有最低 "rank" 值的记录，但并不总是保证值是 1，则可以这样做：

select distinct on (tab1.id1, tab1.id2, tab1.date)
  tab1.id1, tab1.id2, tab1.date, tab2.price
from
  tab1
  join tab2 on
    tab1.id1 = tab2.id1 and
    tab1.id2 = tab2.id2 and
    tab1.date = tab2.date and
    tab2.rank = 1 -- add this line
order by
  tab1.id1, tab1.id2, tab1.date, tab2.rank

我知道我刚才说我避免 select distinct，但这实际上是一个 select distinct on，这是完全不同的，order by 使得保留哪条记录变得非常明确为什么。

select presto 上多列的不同连接

select distinct join on multiple column on presto

sql

postgresql

presto