PostgreSQL 通过 GROUP BY 删除重复项

PostgreSQL remove duplicates by GROUP BY

我想打印一个人的最后一条消息,但每个人只能打印他的最新消息。我使用 PostgreSQL 10.

+-----------+----------+--------------+
| name      |   body   |  created_at  |
+-----------+----------+--------------+
| Maria     | Test3    |  2017-07-07  |
| Paul      | Test5    |  2017-06-01  |
+-----------+----------+--------------+

我已经用下面的 SQL 查询试过了,这给了我正确的回报,但不幸的是,人们加倍了。

SELECT * FROM messages 
WHERE receive = 't'
GROUP BY name
ORDER BY MAX(created_at) DESC
+-----------+----------+--------------+
| name      |   body   |  created_at  |
+-----------+----------+--------------+
| Maria     | Test1    |  2016-06-01  |
| Maria     | Test2    |  2016-11-01  |
| Maria     | Test3    |  2017-07-07  |
| Paul      | Test4    |  2017-01-01  |
| Paul      | Test5    |  2017-06-01  |
+-----------+----------+--------------+

我尝试使用 DISTINCT 删除重复项,但不幸的是我收到此错误消息:

SELECT DISTINCT ON (name) * FROM messages 
WHERE receive = 't'
GROUP BY name
ORDER BY MAX(created_at) DESC
ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions LINE 1: SELECT DISTINCT ON (name) * FROM messages ^ : SELECT DISTINCT ON (name) * FROM messages WHERE receive = 't' GROUP BY name ORDER BY MAX(created_at) DESC

你有什么办法解决这个问题吗?

您将按如下方式使用 DISTINCT ON

SELECT DISTINCT ON (name) * 
FROM messages 
WHERE receive = 't'
ORDER BY name, created_at DESC

即:

  • 不需要 GROUP BY 子句

  • DISTINCT ON(...) 中列出的列必须首先出现在 ORDER BY 子句中

  • ...后面是应该用来分组的列(这里是created_at

请注意,distinct on 查询的结果始终按子句中的列排序(因为这种排序用于确定应保留哪些行)。

如果您想更好地控制排序顺序,则可以改用 window 函数:

SELECT *
FROM (
    SELECT m.*, ROW_NUMBER() OVER(PARTITION BY name ORDER BY created_at DESC) rn
    FROM messages m
    WHERE receive = 't'
) t
WHERE rn = 1
ORDER BY created_at DESC

使用DISTINCT ON,但用正确的ORDER BY:

SELECT DISTINCT ON (name) m.*
FROM messages m
WHERE receive = 't'
ORDER BY name, created_at DESC;

一般来说,您不会将 DISTINCT ONGROUP BY 一起使用。它与 ORDER BY 一起使用。它的工作方式是根据 ORDER BY 子句为每个 name 选择第一行。

你不应该把你正在做的事情想成聚合。您要根据 created_at 进行过滤。在许多数据库中,您可以使用相关子查询来表达这一点:

select m.*
from messages m
where m.created_at = (select max(m2.created_at)
                      from messages m2
                      where m2.name = m.name and m2.receive = 't'
                     ) and
      m.receive = 't';   -- this condition is probably not needed
SELECT * 
FROM messages 
WHERE receive = 't' and not exists (
    select 1
    from messages m
    where m.receive = message.receive and messages.name = m.name and m.created_at > messages.created_at
)
ORDER BY created_at DESC

上面的查询找到了满足以下条件的邮件:

  • 收到的是't'
  • 不存在另一条消息
    • 接收值相同
    • 同名
    • 并且更新

假设同名没有同时发送两条消息,这应该足够了。另一个要点是名称可能看起来相似,但如果值中存在一些白色字符,则名称可能看起来不同,因此,如果您在结果中看到两条名称相同但 created_at 不同的记录如上查询,那极有可能是白字在捉弄你