删除其中任意几列重复的行

Question

我有一个带有 ID 列和几个属性列的数据框。我想删除我的数据框中的所有行，其中任何一个属性列（或多个）与任何其他属性列相同。换句话说，我只想保留行中每个属性都是唯一值的行。

例如，使用此代码：

    example = data.frame(id = c("a", "b", "c", "d"), attr1 = seq(1,4), attr2 =     c(2, 3, 3, 1), attr3 = c(1, 2, 3, 3))

导致此数据框：

id  attr1   attr2   attr3
a     1     2       1
b     2     3       2
c     3     3       3
d     4     1       3

我想删除除最后一行以外的所有行，ID 为 "d"。

我一直在寻找方法来做到这一点，但我不确定如何解决这个特殊问题（在行中是唯一的）——如果它们是列，那就很容易了。

提前致谢！

Answer 1

你可以试试anyDuplicated

 example[!apply(example[-1], 1, anyDuplicated),]
 #  id attr1 attr2 attr3
 #4  d     4     1     3

或者

 example[apply(example[-1],1, function(x) length(unique(x))==3),]

或使用regex

 example[!nzchar(sub('^(?:([0-9])(?!.*\1))*$', '',
              do.call(paste0, example[-1]), perl=TRUE)),]

基准

example1 <- example[rep(1:nrow(example),1e6),]
system.time(example1[!apply(example1[-1], 1, anyDuplicated),])
#   user  system elapsed 
# 32.953   0.222  33.239 

 system.time(example1[!apply(example1[-1], 1,
       function(x) length(unique(x))==3),])
#   user  system elapsed 
# 35.409   0.185  35.659 

system.time(example1[!nzchar(sub('^(?:([0-9])(?!.*\1))*$', 
           '', do.call(paste0, example1[-1]), perl=TRUE)),])
# user  system elapsed 
# 10.033   0.020  10.069

删除其中任意几列重复的行

Deleting rows where any of several columns is a duplicate

r

unique

data-manipulation

基准