从 table 中删除没有唯一键的重复行

Question

如何删除 Postgres 9 中的重复行 table，行在每个字段上都是完全重复的，并且没有单独的字段可以用作唯一键，所以我不能 GROUP BY 列并使用 NOT IN 语句。

我正在寻找单个 SQL 语句，而不是需要我创建临时 table 并将记录插入其中的解决方案。我知道该怎么做，但需要更多工作才能适应我的自动化流程。

Table定义：

jthinksearch=> \d releases_labels;
Unlogged table "discogs.releases_labels"
   Column   |  Type   | Modifiers
------------+---------+-----------
 label      | text    |
 release_id | integer |
 catno      | text    |
Indexes:
    "releases_labels_catno_idx" btree (catno)
    "releases_labels_name_idx" btree (label)
Foreign-key constraints:
    "foreign_did" FOREIGN KEY (release_id) REFERENCES release(id)

示例数据：

jthinksearch=> select * from releases_labels  where release_id=6155;
    label     | release_id |   catno
--------------+------------+------------
 Warp Records |       6155 | WAP 39 CDR
 Warp Records |       6155 | WAP 39 CDR

Answer 1

你可以这样试试：

CREATE TABLE temp 
INSERT INTO temp SELECT DISTINCT * FROM discogs.releases_labels;
DROP TABLE discogs.releases_labels;
ALTER TABLE temp RENAME TO discogs.releases_labels;

Answer 2

因为您没有主键，所以没有简单的方法来区分重复的行和其他行。这就是为什么强烈建议任何 table 都具有主键 (*) 的原因之一。

所以你只剩下 2 个解决方案:

按照 Rahul 的建议使用临时 table（恕我直言，这是一种更简单、更简洁的方法）(**)

使用程序 SQL 和来自程序语言（例如 Python 或 [在此处输入您的首选语言] 或 PL/pgSQL 的光标。类似的东西（小心未经测试）：

CREATE OR REPLACE FUNCTION deduplicate() RETURNS integer AS $$
DECLARE
 curs CURSOR FOR SELECT * FROM releases_labels ORDER BY label, release_id, catno;
 r releases_labels%ROWTYPE;
 old releases_labels%ROWTYPE;
 n integer;
BEGIN
 n := 0;
 old := NULL;
 FOR rec IN curs LOOP
  r := rec;
  IF r = old THEN
   DELETE FROM releases_labels WHERE CURRENT OF curs;
   n := n + 1;
  END IF;
  old := rec;
 END LOOP;
 RETURN n;
END;
$$ LANGUAGE plpgsql;

SELECT deduplicate();

应该删除重复的行，return实际删除的行数。这不一定是最有效的方法，但您只需触摸需要删除的行，因此您不必锁定整个 table.

(*) 希望 PostgreSQL 提供可以用作键的 ctid 伪列。如果您 table 包含一个 oid 列，您也可以使用它，因为它永远不会改变。

(**) PostgreSQL WITH 允许你在单个 SQL 语句

中做到这一点

这两点来自 Nick Barnes 的回答

Answer 3

如果您有能力重写整个 table，这可能是最简单的方法：

WITH Deleted AS (
  DELETE FROM discogs.releases_labels
  RETURNING *
)
INSERT INTO discogs.releases_labels
SELECT DISTINCT * FROM Deleted

如果您需要专门针对重复记录，您可以使用内部 ctid 字段，它唯一标识一行：

DELETE FROM discogs.releases_labels
WHERE ctid NOT IN (
  SELECT MIN(ctid)
  FROM discogs.releases_labels
  GROUP BY label, release_id, catno
)

小心ctid；它随着时间的推移而变化。但是您可以相信它在单个语句的范围内保持不变。

Answer 4

单个 SQL 语句

这是一个删除重复项的解决方案：

DELETE FROM releases_labels r
WHERE  EXISTS (
   SELECT 1
   FROM   releases_labels r1
   WHERE  r1 = r
   AND    r1.ctid < r.ctid
   );

由于没有唯一键，我（ab）使用元组 ID ctid 来达到目的。物理上的第一行在每组重复中都存在。

In-order sequence generation
How do I (or can I) SELECT DISTINCT on multiple columns?

ctid 是不属于关联行类型的系统列，因此当在表达式 r1 = r 中使用 table 别名引用整行时，仅 visible 列进行比较（不是 ctid 或其他列）。这就是为什么整行可以相等并且一个 ctid 仍然比另一个小。

只有几个个重复项，这也是所有解决方案中最快的。
很多的重复，其他解决方案更快。

那我建议：

ALTER TABLE discogs.releases_labels ADD COLUMN releases_labels_id serial PRIMARY KEY;

为什么它对 NULL 值有效？

这有点令人惊讶。原因在chapter Composite Type Comparison in the manual:

中解释

The SQL specification requires row-wise comparison to return NULL if the result depends on comparing two NULL values or a NULL and a non-NULL. PostgreSQL does this only when comparing the results of two row constructors (as in Section 9.23.5) or comparing a row constructor to the output of a subquery (as in Section 9.22). In other contexts where two composite-type values are compared, two NULL field values are considered equal, and a NULL is considered larger than a non-NULL. This is necessary in order to have consistent sorting and indexing behavior for composite types.

大胆强调我的。

第二个选项 table

我删除了那个部分，因为更好。

Answer 5

由于您以后还需要避免重复，所以您可以在去重时添加代理键和唯一约束：

-- add surrogate key
ALTER TABLE releases_labels
        ADD column id SERIAL NOT NULL PRIMARY KEY
        ;

-- verify
SELECT * FROM releases_labels;

DELETE FROM releases_labels dd
WHERE EXISTS (SELECT *
        FROM releases_labels x
        WHERE x.label = dd.label
        AND x.release_id = dd.release_id
        AND x.catno = dd.catno
        AND x.id < dd.id
        );

-- verify
SELECT * FROM releases_labels;

-- add unique constraint for the natural key
ALTER TABLE releases_labels
        ADD UNIQUE (label,release_id,catno)
        ;

-- verify
SELECT * FROM releases_labels;

从 table 中删除没有唯一键的重复行

Delete duplicate rows from table with no unique key

sql

postgresql

duplicates

duplicate-removal

单个 SQL 语句

为什么它对 NULL 值有效？

第二个选项 table