如何遍历列并根据相同的值对数据集进行子集化

Question

我正在尝试遍历具有相同值的列和子集数据。

见下文。

White <- rep(0:1, 50)
Latino <- rep(0:1, 50)
Black <- rep(0:1, 50)
Asian <- rep(0:1, 50)
DV <- seq(1: length(rep(0:1, 50)))
x <- data.frame(cbind(White, Latino, Black, Asian, DV))


race <- c("White", "Latino", "Black", "Asian")

for(j in race){
  for (i in race){

    df_1 <- subset(x, i == 1)
    df_2 <- subset(x, j == 1)
    print(paste(i, j, sep = " "))
    print(t.test(df_1$DV, df_2$DV) )


  }
}

不幸的是，r 不喜欢 i 或 j 单独存在。如果有人知道一种更好的遍历列以对相同值进行子集化的方法，将不胜感激。谢谢

Answer 1

请注意，您代码中的 i 和 j 是一个字符串，但实际上您想要提取该列，例如

for(j in race){
  for (i in race){

    df_1 <- subset(x, x[,i] == 1)
    df_2 <- subset(x, x[,j] == 1)
    print(paste(i, j, sep = " "))
    print(t.test(df_1$DV, df_2$DV) )


  }
}

关于更好的循环方式，虚拟变量White、Latino、Black和Asian似乎是互斥的，因此，也许我们可以将数据重新排列成

      race  DV
   ------------
1    Black   1
2    White   2
3   Latino   3
4    Black   4
5    Asian   5

并使用公式调用 t.test，例如

# generate synthetic data
rnd.race <- sample(1:4, 50, replace=T)
x <- data.frame(
  White = as.integer(rnd.race == 1),
  Latino = as.integer(rnd.race == 2),
  Black = as.integer(rnd.race == 3),
  Asian = as.integer(rnd.race == 4),
  DV = seq(1: length(rep(0:1, 50)))
)

race <- c("White", "Latino", "Black", "Asian")

# rearrange data, gather columns of dummy variables
x.cleaned = data.frame(
  race = race[apply(x[,1:4], 1, which.max)],
  DV = x$DV
)

t.test( DV ~ race, data=x.cleaned, race %in% c("White", "Black"))

# 
#     Welch Two Sample t-test
# 
# data:  DV by race
# t = -0.91517, df = 42.923, p-value = 0.3652
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
#  -25.241536   9.483961
# sample estimates:
# mean in group Black mean in group White 
#            47.66667            55.54545 
#

将 t.test 与公式一起使用的最大好处是它的可读性。例如，在 t.test 的报告中，而不是 mean in group x 和 mean in group y，它会说 mean in group Black、mean in group White，并且公式本身声明变量在我们正在测试协变。

到运行 t-test 迭代所有对，我们可以

run.test = function(race.pair) {
    list(t.test(DV ~ race, data=x.cleaned, race %in% race.pair) )
}

combn(race, 2, FUN = run.test)

# [[1]]
# 
#     Welch Two Sample t-test
# 
# data:  DV by race
# t = -0.30892, df = 41.997, p-value = 0.7589
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
#  -21.22870  15.59233
# sample estimates:
# mean in group Latino  mean in group White 
#             52.72727             55.54545 
# 
# 
# [[2]]
# 
#     Welch Two Sample t-test
# 
# data:  DV by race
# t = -0.91517, df = 42.923, p-value = 0.3652
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
#  -25.241536   9.483961
# sample estimates:
# mean in group Black mean in group White 
#            47.66667            55.54545 
# 
# ...

其中 combn(x, m, FUN = NULL, simplify = TRUE, ...) 是一个内置函数，用于一次生成 x 元素的所有组合 m。有关使用 outer 的更多生成案例，请参阅。

最后，恕我直言，在比较三个或更多组之间的均值时，方差分析可能比 t-test 得到更广泛的认可（也可能暗示为什么 "inconvenient" 迭代地使用 t-test对组）。

使用x.cleaned，我们可以轻松地在R中使用方差分析，如：

aov.out = aov(DV ~ race, data=x.cleaned)
summary(aov.out)

请注意，在 one-way 方差分析（测试某些组的平均值是否不同）之后，我们还可以运行 Post 随机测试（如 TukeyHSD(aov.out)）以找出特定的对组有不同的意思。在正式报告中，一些假设测试也是 de rigueur。 Here is a lecture notes related to this. And this 是 Cross-Validated 上的一个相关问题（可以回答有关选择哪个测试的更多问题）。

Answer 2

您可能需要添加 get

for(j in race){
     for (i in race){

         df_1 <- subset(x, get(i) == 1)
         df_2 <- subset(x, get(j) == 1)
         print(paste(i, j, sep = " "))
         print(t.test(df_1$DV, df_2$DV) )


     }
 }

Answer 3

在R中，我们也可以用outer

来做到这一点

f1 <- function(u, v) list(t.test(x$DV[x[[u]] ==1], x$DV[x[[v]] == 1]))
out <- outer(race, race, FUN = Vectorize(f1))
out[1,1]
#[[1]]

#   Welch Two Sample t-test

#data:  x$DV[x[[u]] == 1] and x$DV[x[[v]] == 1]
#t = 0, df = 98, p-value = 1
#alternative hypothesis: true difference in means is not equal to 0
#95 percent confidence interval:
# -11.57133  11.57133
#sample estimates:
#mean of x mean of y 
#       51        51

可以做成list输出

lst1 <-  setNames(lapply(out, I), outer(race, race, FUN = paste)

如何遍历列并根据相同的值对数据集进行子集化

How to loop though columns and subset the dataset according to the same value

loops

r

subset

data-management