使用正则表达式查询删除 "duplicate" 行

Question

我正在使用 PostgreSQL。我有一个 table keywords:

# Table name: keywords
#
#  id         :integer not null, primary key
#  text       :string  not null
#  match_type :string  not null
#  adgroup_id :integer not null

Table 有一个唯一索引 USING btree (match_type, adgroup_id, text)

现在，问题是对于相同的 adgroup_id 和 match_type 有像 "Hello" 和 " Hello" 或 "Hello " 或 " Hello " 这样的文本（注意 leading/trailing 白色 spaces）。问题是 text 列在字符串的开头和结尾包含那些 spaces 导致错误数据（如果没有那些 whitespaces 就不会通过 uniq 索引）。

我打算以后在插入前添加一个白色-space修剪，但首先我需要清理数据。

如何删除 "duplicate" 数据留下唯一的数据（基于字符串比较没有前导和尾随 spaces）？

Answer 1

这是一个选项，使用 CTE。 CTE 发现所有 (match_type, adgroup_id) 组具有两个或更多 text 值，这些值与修剪前导和尾随空白相同。我们还计算了以下内容：

cnt - 对于每个组，文本的 "pure" 版本出现的次数。这里的 Pure 表示没有前导或尾随空格的文本
rn - 每个 (match_type, adgroup_id) 组的任意行号，从值 1

然后，仅当一行出现在重复组中并且它不是文本的纯版本 (cnt > 0)，或者任意行号大于 1 时，我们才会删除该行。这意味着对于 "Hello " 和 " Hello" 的情况，这两条记录中的一条将被任意删除。但是，如果有第三条 "pure" 记录 "Hello"，那么这将被保留，并且前两个案例都将被删除。

with cte as (
    select match_type, adgroup_id, trim(text) as text,
        count(case when text = trim(text) then 1 end) as cnt,
        row_number() over (partition by match_type, adgroup_id order by trim(text)) rn
    from keywords
    group by match_type, adgroup_id, trim(text)
    having count(*) > 1
)

delete
from keywords k1
where exists (select 1 from cte k2
              where k1.match_type = k2.match_type and
                    k1.adgroup_id = k2.adgroup_id and
                    k1.text <> k2.text and (k2.cnt > 0 or k2.rn > 1));

Answer 2

demo:db<>dbfiddle （示例包含两组："Hello" 没有一个没有空格的元素；"Bye" 包含两个没有空格的元素）

DELETE FROM keywords
WHERE id NOT IN (
    SELECT DISTINCT ON (trim(text))                 --1
        id
    FROM
        keywords
    ORDER BY 
        trim(text), 
        text = trim(text) DESC                   --2
)

对修剪后的文本进行分组。
按修剪文本和文本是否没有空格的信息排序。如果有一个元素，那么它将排在第一位并由 DISTINCT ON 子句获取。如果有 none 将采用另一个元素

包含附加列的解决方案：

    DELETE FROM keywords
    WHERE id NOT IN (
        SELECT DISTINCT ON (match_type, adgroup_id, trim(text))
            id
        FROM
            keywords
        ORDER BY 
            match_type,
            adgroup_id,
            trim(text), 
            text = trim(text) DESC
    )

使用正则表达式查询删除 "duplicate" 行

Query to remove "duplicate" rows using regexp

database

postgresql

unique-constraint