获取两列中表示的所有类别组合的摘要数据框

Question

我正在使用对应于以下示例的数据框：

set.seed(1)
dta <- data.frame("CatA" = rep(c("A","B","C"), 4), "CatNum" = rep(1:2,6),
                  "SomeVal" = runif(12))

我想快速构建一个数据框，其中包含从 CatA 和 CatNum 派生的类别的所有组合以及从每一列派生的类别的总和值分别地。在上面的原始示例中，对于前几个组合，这可以通过使用简单代码来实现：

df_sums <- data.frame(
  "Category" = c("Total for A",
                 "Total for A and 1",
                 "Total for A and 2"),
  "Sum" = c(sum(dta$SomeVal[dta$CatA == 'A']),
            sum(dta$SomeVal[dta$CatA == 'A' & dta$CatNum == 1]),
            sum(dta$SomeVal[dta$CatA == 'A' & dta$CatNum == 2]))
)

这会产生信息丰富的总和数据框：

           Category       Sum
1       Total for A 2.1801780
2 Total for A and 1 1.2101839
3 Total for A and 2 0.9699941

当应用于具有多个类别的数据框时，此解决方案将非常低效。我想实现以下目标：

循环遍历所有类别，包括分别从每一列派生的类别以及同时从两列派生的类别
在如何应用函数方面实现一些灵活性，例如我可能想应用 mean 而不是 sum
将 Total for 字符串保存为一个单独的对象，我可以在应用 sum.

我最初考虑使用 dplyr，行：

require(dplyr)
df_sums_experiment <- dta %>%
  group_by(CatA, CatNum) %>%
  summarise(TotVal = sum(SomeVal))

但我不清楚如何同时应用多个分组。如前所述，我有兴趣分别按每一列和两列的组合进行分组。我还想创建一个字符串列来指示组合的内容和顺序。

Answer 1

拆分然后使用应用

#result
res <- do.call(rbind,
               lapply(
                 c(split(dta,dta$CatA),
                   split(dta,dta$CatNum),
                   split(dta,dta[,1:2])),
                 function(i)sum(i[,"SomeVal"])))

#prettify the result
res1 <- data.frame(Category=paste0("Total for ",rownames(res)),
                   Sum=res[,1])
res1$Category <- sub("."," and ",res1$Category,fixed=TRUE)
row.names(res1) <- seq_along(row.names(res1))

res1
#             Category       Sum
# 1        Total for A 2.1801780
# 2        Total for B 1.4405782
# 3        Total for C 2.2769138
# 4        Total for 1 2.8198078
# 5        Total for 2 3.0778622
# 6  Total for A and 1 1.2101839
# 7  Total for B and 1 0.4076565
# 8  Total for C and 1 1.2019674
# 9  Total for A and 2 0.9699941
# 10 Total for B and 2 1.0329217
# 11 Total for C and 2 1.0749464

Answer 2

您可以使用 tidyr 来 unite 列和 gather 数据。然后用dplyr总结一下：

library(dplyr)
library(tidyr)
dta %>% unite(measurevar, CatA, CatNum, remove=FALSE) %>%
        gather(key, val, -SomeVal)  %>%
        group_by(val) %>%
        summarise(sum(SomeVal))

     val sum(SomeVal)
   (chr)        (dbl)
1      1    2.8198078
2      2    3.0778622
3      A    2.1801780
4    A_1    1.2101839
5    A_2    0.9699941
6      B    1.4405782
7    B_1    0.4076565
8    B_2    1.0329217
9      C    2.2769138
10   C_1    1.2019674
11   C_2    1.0749464

Answer 3

只需遍历列组合，计算您想要的数量，然后 rbind 将它们放在一起：

library(data.table)
dt = as.data.table(dta) # or setDT to convert in place

cols = c('CatA', 'CatNum')

rbindlist(apply(combn(c(cols, ""), length(cols)), 2,
                function(i) dt[, sum(SomeVal), by = c(i[i != ""])]), fill = T)
#    CatA CatNum        V1
# 1:    A      1 1.2101839
# 2:    B      2 1.0329217
# 3:    C      1 1.2019674
# 4:    A      2 0.9699941
# 5:    B      1 0.4076565
# 6:    C      2 1.0749464
# 7:    A     NA 2.1801780
# 8:    B     NA 1.4405782
# 9:    C     NA 2.2769138
#10:   NA      1 2.8198078
#11:   NA      2 3.0778622

获取两列中表示的所有类别组合的摘要数据框

Getting a summary data frame for all the combinations of categories represented in two columns

aggregate

r

sum

dataframe