带有 ID 列的重复电子邮件地址

Question

我的 table 包含重复的电子邮件地址。每个电子邮件地址都有唯一的创建日期和唯一的 ID。我想识别具有最近创建日期及其关联 ID 的电子邮件地址，并显示重复的 ID 及其创建日期。我希望查询以下列格式显示：

第 1 列：电子邮件地址
第 2 列：IDKeep
第 3 列：CreateDateofIDKeep
第 4 列：重复 ID
第 5 列：CreateDateofDuplicateID

注意：在某些情况下，存在超过 2 个重复的电子邮件地址。我希望查询在新行中显示每个额外的重复项，并在这些实例中重新说明 EmailAddress 和 IDKeep。

我试图将在此处找到的不同查询拼凑在一起，但无济于事。我目前不知所措 -- 任何 help/direction 将不胜感激。

Answer 1

复杂的查询最好通过分解成多个部分并按部就班来解决。

首先让我们创建一个查询来查找我们要保留的行的键，方法是查找每封电子邮件的最新创建日期，然后加入以获取 ID：

select x.Email, x.CreateDate, x.Id
from myTable x
join (
    select Email, max(CreateDate) as CreateDate
    from myTable
    group by Email
) y on x.Email = y.Email and x.CreateDate = y.CreateDate

好的，现在让我们查询以获取重复的电子邮件地址：

select Email
from myTable
group by Email
having count(*) > 1

并将此查询连接回 table 以获取具有重复项的每一行的键：

select x.Email, x.Id, x.CreateDate
from myTable x
join (
    select Email
    from myTable
    group by Email
    having count(*) > 1
) y on x.Email = y.Email

太棒了。现在剩下的就是将第一个查询与这个查询连接起来以获得我们的结果：

select keep.Email, keep.Id as IdKeep, keep.CreateDate as CreateDateOfIdKeep,
    dup.Id as DuplicateId, dup.CreateDate as CreateDateOfDuplicateId
from (
    select x.Email, x.CreateDate, x.Id
    from myTable x
    join (
        select Email, max(CreateDate) as CreateDate
        from myTable
        group by Email
    ) y on x.Email = y.Email and x.CreateDate = y.CreateDate
) keep
join (
    select x.Email, x.Id, x.CreateDate
    from myTable x
    join (
        select Email
        from myTable
        group by Email
        having count(*) > 1
    ) y on x.Email = y.Email
) dup on keep.Email = dup.Email and keep.Id <> dup.Id

请注意连接的最后一个 keep.Id <> dup.Id 谓词确保我们不会为 keep 和 dup.

获得相同的行

Answer 2

以下子查询使用技巧获取每封电子邮件的最新 ID 和创建日期：

select Email, max(CreateDate) as CreateDate,
       substring_index(group_concat(id order by CreateDate desc), ',', 1) as id
from myTable
group by Email
having count(*) > 1;

having() 子句还确保这仅适用于重复的电子邮件。

然后，此查询只需与其余数据组合即可获得您想要的格式：

select t.Email, tkeep.id as keep_id, tkeep.CreateDate as keep_date,
       id as dup_id, CreateDate as dup_CreateDate
from myTable t join
     (select Email, max(CreateDate) as CreateDate,
             substring_index(group_concat(id order by CreateDate desc), ',', 1) as id
      from myTable
      group by Email
      having count(*) > 1
     ) tkeep
     on t.Email = tkeep.Email and t.CreateDate <> tkeep.CreateDate;

带有 ID 列的重复电子邮件地址

duplicate email addresses with ID column

sql

email

duplicate-detection