查找具有重复字段的 ID 时遇到问题

Question

我的数据是这样的：

ID  Email
1   someone@hotmail.com
2   someone1@hotmail.com
3   someone2@hotmail.com
4   someone3@hotmail.com
5   someone4@hotmail.com
6   someone5@hotmail.com

每个 ID 应该恰好有 1 个电子邮件地址，但实际上没有。

> dim(data)
[1] 5071    2
> length(unique(data$Person_Onyx_Id))
[1] 5071
> length((data$Email))
[1] 5071
> length(unique(data$Email))
[1] 4481

所以，我需要找到具有重复电子邮件地址的 ID。

看起来应该很容易，但我要删除:

> sqldf("select ID, count(Email) from data  group by ID having count(Email) > 1")
[1] ID count(Email)  
<0 rows> (or 0-length row.names)

我也试过取消 having 子句并将结果发送到对象并按 count(Email) 对对象进行排序...似乎每个 ID 都有count(Email) 共 1...

我会 dput 实际数据，但由于电子邮件地址的敏感性，我不能。

Answer 1

我猜你有 NULL 封电子邮件。您可以使用 count(*) 而不是 count(email):

找到它

select ID, count(*)
from data
group by ID
having count(*) > 1;

Answer 2

你也确定你没有相反的情况，多个ID具有相同的电子邮件吗？

select Email, count(*)
from data
group by Email
having count(*) > 1;

查找具有重复字段的 ID 时遇到问题

Trouble Finding ID's with Duplicate Fields

sql

r

sqldf