使用数据框列表的规则折叠数据

Collapse data with rules for a list of data frames

我有一个数据框需要根据定义的组折叠。数据由数百组组成。每组可能有 2-5 行。为简单起见,我的示例显示了 3 个组,每组 2-4 行。

我想展平每个组中的重复。对于组中的每一列,我想 return 不是 NA 的最大出现值。问题在于在平局的情况下该怎么办。对于绑定,我需要根据绑定的值类型设置自定义规则。一个可能的绝望选择是将绑定的值粘贴在一起,用逗号分隔,我可以用 find/replace 方式处理它们。

要获得最大出现值,我可以使用 max 函数。关于如何处理关系有什么建议吗?

#Input Data Example
> data
   Group Loc1 Loc2 Loc3 Loc4
1 Group1  A/B  A/A  B/B   NA
2 Group1  A/B  A/A  B/B  A/A
3 Group1  A/A  A/A  A/A   NA
4 Group1  A/A  A/A  A/A   NA
5 Group2  A/A   NA  C/C  B/B
6 Group2  B/B  A/A  C/C  B/B
7 Group2  B/B  A/A  C/C  B/B
8 Group3  B/B  B/B   NA  B/B
9 Group3  B/B  B/B   NA  A/A

#Desired Collapsed Output
> data.collapsed
   Group Loc1 Loc2 Loc3 Loc4
1 Group1   NA  A/A  A/B  A/A
2 Group2  B/B  A/A  C/C  B/B
3 Group3  B/B  B/B   NA  A/B

最终代码(2015 年 1 月 27 日更新)

library(data.table)
#Data Frame
#Each group has replicates of data that need to be collapsed to make a consensus data replicate
data = rbind(c("Group1","A/B", "A/A","B/B",NA), c("Group1","A/B", "A/A","B/B","A/A"), c("Group1","A/A", "A/A","A/A",NA),
         c("Group1","A/A", "A/A","A/A",NA), c("Group2","A/A", NA,"C/C","B/B"), c("Group2","B/B", "A/A","C/C","B/B"), 
         c("Group2","B/B", "A/A","C/C","B/B"), c("Group3","B/B", "B/B",NA,"B/B"), c("Group3","B/B", "B/B",NA,"A/A"))
colnames(data) = c("Group", "Loc1", "Loc2", "Loc3", "Loc4")
data = as.data.frame(data)
data

#Define acceptable value types; these could be used to define what to do in the case of a tie
same.letter = c("A/A","B/B","C/C")
diff.letter = c("A/B","A/C","B/C")

#Function for collapsing data with rules
RepMerge = function(col) {
  z = table(col);
  z.max = which(z==max(z));

  ifelse(length(z.max) > 2, "NA",  #if tied between more than 2 different values, report NA
      ifelse(length(z.max) == 1, names(z)[z.max], #if one max value, report that value
          ifelse(length(z.max) == 2 & names(z)[z.max][1] %in% same.letter & names(z)[z.max][2] %in% same.letter, paste(substring(names(z)[z.max][1],1,1),substring(names(z)[z.max][2],1,1), sep="/"), #if both max values are different but are in 'same.letter', report a combination
              ifelse(length(z.max) == 2 & names(z)[z.max][1] %in% diff.letter | names(z)[z.max][2] %in% diff.letter, "NA", "Check Code")))) #if one of the max values is in diff.letter, report NA. If no cases fit the above, report "Check Code"
}

setDT(data)[,lapply(.SD,RepMerge),Group] # run function to collapse the data

谢谢, SC2

这是基于data.table的解决方案:

library(data.table)
setDT(data)[,lapply(.SD,function(cl) {z<-table(cl);z.max<-which(z==max(z));ifelse(length(z.max)>1,"NA",names(z)[z.max])}),Group]

#    Group Loc1 Loc2 Loc3 Loc4
#1: Group1   NA  A/A   NA   NA
#2: Group2  B/B  A/A  C/C  B/B
#3: Group3  B/B  B/B   NA   NA

通过修改 ifelse,您可以设置处理平局和 NA 的所需规则。

PS:您很好奇为什么 max 函数会忽略代码中的 NA。发生这种情况是因为您的数据 table 包含字符串 'NA',而不是实际的 NAs.