SELECT 行基于两列的独特性

Question

假设我们有以下 table

orderId  productId   orderDate              amount    
1        2           2017-01-01 20:00:00    10 
1        2           2017-01-01 20:00:01    10 
1        3           2017-01-01 20:30:10    5 
1        4           2017-01-01 22:31:10    1

已知前 2 行是重复的（例如错误软件的结果），因为 orderId+productId 必须形成一个唯一键

我想删除此类重复项。如何以最有效的方式做到这一点？

如果 orderDate 没有一秒的差异，我们可以使用

SELECT DISTINCT * FROM `table`

不同的是，可以使用groupby:

SELECT `orderId`,`productId`,MIN(`orderDate`),MIN(`amount`)
FROM table
GROUP BY `orderId`,`productCode`

如果有很多列，我发现后一个命令非常累人。还有哪些选择？

更新： 我正在使用 Snowflake。

Answer 1

如果您的 dbms 支持 ROW_NUMBER window 函数，则

select * from 
(
select row_number()Over(Partition by orderId,productId order by orderDate asc) as rn,*
From yourtable 
)a
Where Rn = 1

Answer 2

您可以使用 NOT EXISTS 来排除具有更好匹配的记录：

select * from mytable
where not exists
(
  select *
  from mytable other
  where other.orderid   = mytable.orderid
    and other.productid = mytable.productid
    and other.orderdate < mytable.orderdate
);

Answer 3

这看起来好像你想在具有共同orderid和productid的记录中获取具有最小orderdate值的记录。这可以用SQL表示如下：

select * from mytable t where t.orderdate = 
  (select min(t2.orderdate)
   from mytable t2
   where t2.orderid = t.orderid 
     and t2.productid = t.productid);

请注意，此查询无法消除列 orderid、productid 和 orderdate 中的完全重复项；但这实际上并没有被要求。

Answer 4

Pரதீப் 的答案很好，但 Snowflake 现在支持 QUALIFY 功能，它允许您避免 sub-select/WHERE 模式并在一层进行，因此您的 SQL 可以写成

SELECT * FROM table_name
QUALIFY ROW_NUMBER() OVER (PARTITION BY orderId, productId ORDER BY orderDate) = 1
ORDER BY 1,2;

并使用 VALUES

加载的虚拟数据

SELECT * FROM VALUES  
    (1, 2, '2017-01-01 20:00:00', 10),
    (1, 2, '2017-01-01 20:00:01', 10),
    (1, 3, '2017-01-01 20:30:10', 5),
    (1, 4, '2017-01-01 22:31:10', 1) 
    t(orderId, productId, orderDate, amount)
QUALIFY ROW_NUMBER() OVER (PARTITION BY orderId, productId ORDER BY orderDate) = 1
ORDER BY 1,2;

我们得到了所需的行：

ORDERID	PRODUCTID	ORDERDATE	AMOUNT
1	2	2017-01-01 20:00:00	10
1	3	2017-01-01 20:30:10	5
1	4	2017-01-01 22:31:10	1

不该做的事：

我在评论中看到使用 GROUP BY/MIN，但这将给出每列的最小值，而不是从匹配行中获取所有值，就像在更改后的示例中一样，第一个row 是最早的 (1,2) 行，但 amount 的最小值是另一行的 9.

SELECT orderId, productId, min(orderDate), min(amount) FROM VALUES  
    (1, 2, '2017-01-01 20:00:00', 10),
    (1, 2, '2017-01-01 20:00:01', 9),
    (1, 3, '2017-01-01 20:30:10', 5),
    (1, 4, '2017-01-01 22:31:10', 1) 
    t(orderId, productId, orderDate, amount)
GROUP BY 1,2
ORDER BY 1,2;

ORDERID	PRODUCTID	ORDERDATE	AMOUNT
1	2	2017-01-01 20:00:00	9
1	3	2017-01-01 20:30:10	5
1	4	2017-01-01 22:31:10	1

SELECT 行基于两列的独特性

SELECT rows based on distinctiveness of two columns

sql

snowflake-cloud-data-platform

不该做的事：