基于两个变量的dplyr过滤

Dplyr filtering based on two variables

我想使用 dplyr 来确定数据框中的哪些观测值满足以下条件:

这是玩具数据框:

library(dplyr)

set.seed(seed = 10)

df <- data.frame("Id" = 1:12,
                 "Group" = paste(sapply(toupper(letters[1:3]), rep, times = 4,simplify = T)),
                 "Var1" = sample(rep(c("good","bad"),times = 1000),size = 12),
                 "Var2" = sample(rep(1:10, times = 1000),size = 12))

print(df)

   Id Group Var1 Var2
1   1     A good    6
2   2     A  bad    9
3   3     A good   10
4   4     A good    7
5   5     B  bad    9
6   6     B  bad    1
7   7     B  bad    6
8   8     B good    6
9   9     C good    1
10 10     C  bad    8
11 11     C good    4
12 12     C  bad    2

到目前为止,我已经确定我应该使用 group_by()summarise()filter() 的某种组合,但我似乎无法全神贯注方法来做到这一点。这是我到目前为止的想法:

keepers <- df %>% 
        group_by(Group, Var1) %>%
        summarise(Total = sum(Var2)) %>% 
        print()

Source: local data frame [6 x 3]
Groups: Group [?]

  Group  Var1 Total
  (chr) (chr) (int)
1     A   bad     9
2     A  good    23
3     B   bad    16
4     B  good     6
5     C   bad    10
6     C  good     5

接下来我应该做什么?最终分析应该 return "A",因为它是唯一 Group,其中 good 观察值的 Total 大于 bad 观察值。

使用 spreadfilter 怎么样:

> library(tidyr)
> df %>% group_by(Group, Var1) %>%
+    summarise(Total = sum(Var2)) %>%
+    spread(Var1,Total) %>%
+    filter(good>bad)
Source: local data frame [1 x 3]

  Group bad good
1     A   9   23

data.table 类似的选项。我们将'data.frame'转换为'data.table'(setDT(df)),按'Group'、'Var1'分组,得到'Var2'的sum,从 'long' 重塑为 'wide' 并过滤 'good' 大于 'bad'.

的行
library(data.table)
dcast(setDT(df)[, sum(Var2) , by = .(Group, Var1)], 
               Group~Var1, value.var='V1')[good>bad]
#   Group bad good
#1:     A   9   23