子集数据框仅包含一个因素的水平,该水平在另一个因素的两个水平上都有值

Subset data frame to include only levels of one factor that have values in both levels of another factor

我正在使用处理数字测量的数据框。有些人在青少年和成年时都被测量过多次。 一个可重现的例子:

ID <- c("a1", "a2", "a3", "a4", "a1", "a2", "a5", "a6", "a1", "a3")
age <- rep(c("juvenile", "adult"), each=5)
size <- rnorm(10)

# e.g. a1 is measured 3 times, twice as a juvenile, once as an adult.
d <- data.frame(ID, age, size)

我的目标是通过选择作为青少年至少出现一次和作为成年人至少出现一次的 ID 来对该数据框进行子集化。不知道该怎么做..?

生成的数据框将包含个人 a1、a2 和 a3 的所有测量值,但会排除 a4、a5 和 a6,因为它们在两个阶段都没有测量。

7 个月前有人问过类似的问题,但一直没有答案 (Subset data frame to include only levels one factor that have values in both levels of another factor)

谢谢!

使用dplyr,可以使用group_by %>% filter:

library(dplyr)
d %>% group_by(ID) %>% filter(all(c("juvenile", "adult") %in% age))

# A tibble: 7 x 3
# Groups:   ID [3]
#      ID      age       size
#  <fctr>   <fctr>      <dbl>
#1     a1 juvenile -0.6947697
#2     a2 juvenile -0.3665272
#3     a3 juvenile  1.0293555
#4     a1 juvenile  0.2745224
#5     a2    adult  0.5299029
#6     a1    adult  2.2247802
#7     a3    adult -0.4717160

splitageintersect 和子集:

d[d$ID %in% Reduce(intersect, split(d$ID, d$age)),]
#   ID      age        size
#1  a1 juvenile  1.44761836
#2  a2 juvenile  1.70098645
#3  a3 juvenile  0.08231986
#5  a1 juvenile  0.91240568
#6  a2    adult -1.77318962
#9  a1    adult  0.13597986
#10 a3    adult -1.18575294

这里有一个选项data.table

library(data.table)
setDT(d)[, .SD[all(c("juvenile", "adult") %in% age)], ID]

base R 选项 ave

d[with(d, ave(as.character(age), ID, FUN = function(x) length(unique(x)))>1),]
#   ID      age       size
#1  a1 juvenile -1.4545407
#2  a2 juvenile -0.4695317
#3  a3 juvenile  0.2271316
#5  a1 juvenile  0.2961210
#6  a2    adult -0.8331993
#9  a1    adult -0.6924967
#10 a3    adult -0.4619550