dplyr:如何按子组标准过滤组
dplyr: How to filter groups by subgroup criteria
我的问题与此类似one,但过滤条件不同
> demo(dadmom,package="tidyr")
> library(tidyr)
> library(dplyr)
> dadmom <- foreign::read.dta("http://www.ats.ucla.edu/stat/stata/modules/dadmomw.dta")
> dadmom %>%
+ gather(key, value, named:incm) %>%
+ separate(key, c("variable", "type"), -2) %>%
+ spread(variable, value, convert = TRUE)
famid type inc name
1 1 d 30000 Bill
2 1 m 15000 Bess
3 2 d 22000 Art
4 2 m 18000 Amy
5 3 d 25000 Paul
6 3 m 50000 Pat
使用"incm"从原来的table中很容易挑出妈妈收入>20000的家庭:
> dadmom
famid named incd namem incm
1 1 Bill 30000 Bess 15000
2 2 Art 22000 Amy 18000
3 3 Paul 25000 Pat 50000
问题是:如何从 "tidied" 数据中做到这一点?
您可以在代码中添加 group_by
和 filter
#OP's code
d1 <- dadmom %>%
gather(key, value, named:incm) %>%
separate(key, c("variable", "type"), -2) %>%
spread(variable, value, convert = TRUE)
d1 %>%
group_by(famid) %>%
filter(all(sum(type=='m' & inc > 15000)==sum(type=='m')))
# famid type inc name
# 1 2 d 22000 Art
# 2 2 m 18000 Amy
# 3 3 d 25000 Paul
# 4 3 m 50000 Pat
注意: 当每个 famid 有多个 'm' 时,上面的方法也有效(更通用一点)
正常情况下每个家庭只有 'm/f' 对
d1 %>%
group_by(famid) %>%
filter(any(inc >15000 & type=='m'))
# famid type inc name
#1 2 d 22000 Art
#2 2 m 18000 Amy
#3 3 d 25000 Paul
#4 3 m 50000 Pat
此外,如果您希望使用开发版本中的 data.table
、melt
,即 v1.9.5
可以采用多个值列。它可以从 here
安装
library(data.table)
melt(setDT(dadmom), measure.vars=list(c(2,4), c(3,5)),
variable.name='type', value.name=c('name', 'inc'))[,
type:=c('d', 'm')[type]][, .SD[any(type=='m' & inc >15000)] ,famid]
# famid type name inc
#1: 2 d Art 22000
#2: 2 m Amy 18000
#3: 3 d Paul 25000
#4: 3 m Pat 50000
我的问题与此类似one,但过滤条件不同
> demo(dadmom,package="tidyr")
> library(tidyr)
> library(dplyr)
> dadmom <- foreign::read.dta("http://www.ats.ucla.edu/stat/stata/modules/dadmomw.dta")
> dadmom %>%
+ gather(key, value, named:incm) %>%
+ separate(key, c("variable", "type"), -2) %>%
+ spread(variable, value, convert = TRUE)
famid type inc name
1 1 d 30000 Bill
2 1 m 15000 Bess
3 2 d 22000 Art
4 2 m 18000 Amy
5 3 d 25000 Paul
6 3 m 50000 Pat
使用"incm"从原来的table中很容易挑出妈妈收入>20000的家庭:
> dadmom
famid named incd namem incm
1 1 Bill 30000 Bess 15000
2 2 Art 22000 Amy 18000
3 3 Paul 25000 Pat 50000
问题是:如何从 "tidied" 数据中做到这一点?
您可以在代码中添加 group_by
和 filter
#OP's code
d1 <- dadmom %>%
gather(key, value, named:incm) %>%
separate(key, c("variable", "type"), -2) %>%
spread(variable, value, convert = TRUE)
d1 %>%
group_by(famid) %>%
filter(all(sum(type=='m' & inc > 15000)==sum(type=='m')))
# famid type inc name
# 1 2 d 22000 Art
# 2 2 m 18000 Amy
# 3 3 d 25000 Paul
# 4 3 m 50000 Pat
注意: 当每个 famid 有多个 'm' 时,上面的方法也有效(更通用一点)
正常情况下每个家庭只有 'm/f' 对
d1 %>%
group_by(famid) %>%
filter(any(inc >15000 & type=='m'))
# famid type inc name
#1 2 d 22000 Art
#2 2 m 18000 Amy
#3 3 d 25000 Paul
#4 3 m 50000 Pat
此外,如果您希望使用开发版本中的 data.table
、melt
,即 v1.9.5
可以采用多个值列。它可以从 here
library(data.table)
melt(setDT(dadmom), measure.vars=list(c(2,4), c(3,5)),
variable.name='type', value.name=c('name', 'inc'))[,
type:=c('d', 'm')[type]][, .SD[any(type=='m' & inc >15000)] ,famid]
# famid type name inc
#1: 2 d Art 22000
#2: 2 m Amy 18000
#3: 3 d Paul 25000
#4: 3 m Pat 50000