如何使用`dplyr`包在group_by()之后删除每列的重复项

Question

我有一个这样的data.framemydata

   V1 V2 V3 V4 V5
1  a  b  a      
2  a  b     c   
3  a  b        d
4  x  y  h      
5  x  y     k  e

我想按 V1 和 V2 列分组，并删除其他列中的 "" 字符串

结果应该是这样的

  V1 V2 V3 V4 V5
1  a  b  a  c  d
2  x  y  h  k  e

他们使用 dplyr 包是一种有效的方法吗？非常感谢你。

Answer 1

我们可以使用dplyr/tidyr。我们使用 gather 将数据从 'wide' 重塑为 'long'，使用 filter 删除 'Val' 列中的空白元素，并将其重塑回 'wide' 格式为 spread.

library(dplyr)
library(tidyr) 
gather(mydata, Var, Val, V3:V5) %>% 
              filter(Val!='') %>% 
              spread(Var, Val)
#   V1 V2 V3 V4 V5
#1  a  b  a  c  d
#2  x  y  h  k  e

或仅使用 dplyr 的另一种方法（如果每个组中非空值的数量相同）将按 'V1'、'V2' 分组，然后使用summarise_each 到 select 只有非空白的元素 (.[.!=''])

 mydata %>%
       group_by(V1, V2) %>% 
       summarise_each(funs(.[.!='']))
 #  V1 V2 V3 V4 V5
 #1  a  b  a  c  d
 #2  x  y  h  k  e

我们也可以使用data.table来做到这一点。我们将 'data.frame' 转换为 'data.table' (setDT(mydata))，按 'V1'、'V2' 分组，我们遍历其他列 (lapply(.SD, ...)) 和子集不是空白的元素。

 library(data.table)
 setDT(mydata)[,lapply(.SD, function(x) x[x!='']) ,.(V1, V2)]
 #   V1 V2 V3 V4 V5
 #1:  a  b  a  c  d
 #2:  x  y  h  k  e

使用 base R 中的 aggregate 的类似方法是

 aggregate(.~V1+V2, mydata, FUN=function(x) x[x!=''])
 #  V1 V2 V3 V4 V5
 #1  a  b  a  c  d
 #2  x  y  h  k  e

数据

mydata <- structure(list(V1 = c("a", "a", "a", "x", "x"),
V2 = c("b", "b", 
"b", "y", "y"), V3 = c("a", "", "", "h", ""), V4 = c("", "c", 
"", "", "k"), V5 = c("", "", "d", "", "e")), .Names = c("V1", 
"V2", "V3", "V4", "V5"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5"))

Answer 2

使用基础 R，如果感兴趣的话

x <- data.frame(V1 = c(rep("a", 3), "x", "x"), 
    V2 = c(rep("b", 3), "y", "y"), 
    V3= c("a", "", "", "h", ""), 
    V4 = c("", "c", "", "", "k"), 
    V5 = c(rep("", 2), "d", "", "e"))

temp <- lapply(x[], function(y) as.character(unique(y[y != ""])))
data.frame(do.call(cbind,temp))

  V1 V2 V3 V4 V5
1  a  b  a  c  d
2  x  y  h  k  e

如何使用`dplyr`包在group_by()之后删除每列的重复项

how to delete duplicated duplicated of each column after group_by() using `dplyr` package

r

data-manipulation

dataframe

dplyr

数据