如何遍历列并根据相同的值对数据集进行子集化
How to loop though columns and subset the dataset according to the same value
我正在尝试遍历具有相同值的列和子集数据。
见下文。
White <- rep(0:1, 50)
Latino <- rep(0:1, 50)
Black <- rep(0:1, 50)
Asian <- rep(0:1, 50)
DV <- seq(1: length(rep(0:1, 50)))
x <- data.frame(cbind(White, Latino, Black, Asian, DV))
race <- c("White", "Latino", "Black", "Asian")
for(j in race){
for (i in race){
df_1 <- subset(x, i == 1)
df_2 <- subset(x, j == 1)
print(paste(i, j, sep = " "))
print(t.test(df_1$DV, df_2$DV) )
}
}
不幸的是,r 不喜欢 i 或 j 单独存在。如果有人知道一种更好的遍历列以对相同值进行子集化的方法,将不胜感激。谢谢
请注意,您代码中的 i
和 j
是一个字符串,但实际上您想要提取该列,例如
for(j in race){
for (i in race){
df_1 <- subset(x, x[,i] == 1)
df_2 <- subset(x, x[,j] == 1)
print(paste(i, j, sep = " "))
print(t.test(df_1$DV, df_2$DV) )
}
}
关于更好的循环方式,虚拟变量White
、Latino
、Black
和Asian
似乎是互斥的,因此,也许我们可以将数据重新排列成
race DV
------------
1 Black 1
2 White 2
3 Latino 3
4 Black 4
5 Asian 5
并使用公式调用 t.test
,例如
# generate synthetic data
rnd.race <- sample(1:4, 50, replace=T)
x <- data.frame(
White = as.integer(rnd.race == 1),
Latino = as.integer(rnd.race == 2),
Black = as.integer(rnd.race == 3),
Asian = as.integer(rnd.race == 4),
DV = seq(1: length(rep(0:1, 50)))
)
race <- c("White", "Latino", "Black", "Asian")
# rearrange data, gather columns of dummy variables
x.cleaned = data.frame(
race = race[apply(x[,1:4], 1, which.max)],
DV = x$DV
)
t.test( DV ~ race, data=x.cleaned, race %in% c("White", "Black"))
#
# Welch Two Sample t-test
#
# data: DV by race
# t = -0.91517, df = 42.923, p-value = 0.3652
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -25.241536 9.483961
# sample estimates:
# mean in group Black mean in group White
# 47.66667 55.54545
#
将 t.test
与公式一起使用的最大好处是它的可读性。例如,在 t.test
的报告中,而不是 mean in group x
和 mean in group y
,它会说 mean in group Black
、mean in group White
,并且公式本身声明变量在我们正在测试协变。
到运行 t-test 迭代所有对,我们可以
run.test = function(race.pair) {
list(t.test(DV ~ race, data=x.cleaned, race %in% race.pair) )
}
combn(race, 2, FUN = run.test)
# [[1]]
#
# Welch Two Sample t-test
#
# data: DV by race
# t = -0.30892, df = 41.997, p-value = 0.7589
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -21.22870 15.59233
# sample estimates:
# mean in group Latino mean in group White
# 52.72727 55.54545
#
#
# [[2]]
#
# Welch Two Sample t-test
#
# data: DV by race
# t = -0.91517, df = 42.923, p-value = 0.3652
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -25.241536 9.483961
# sample estimates:
# mean in group Black mean in group White
# 47.66667 55.54545
#
# ...
其中 combn(x, m, FUN = NULL, simplify = TRUE, ...)
是一个内置函数,用于一次生成 x
元素的所有组合 m
。有关使用 outer
的更多生成案例,请参阅 。
最后,恕我直言,在比较三个或更多组之间的均值时,方差分析可能比 t-test 得到更广泛的认可(也可能暗示为什么 "inconvenient" 迭代地使用 t-test对组)。
使用x.cleaned
,我们可以轻松地在R中使用方差分析,如:
aov.out = aov(DV ~ race, data=x.cleaned)
summary(aov.out)
请注意,在 one-way 方差分析(测试某些组的平均值是否不同)之后,我们还可以 运行 Post 随机测试(如 TukeyHSD(aov.out)
)以找出特定的对组有不同的意思。在正式报告中,一些假设测试也是 de rigueur。 Here is a lecture notes related to this. And this 是 Cross-Validated 上的一个相关问题(可以回答有关选择哪个测试的更多问题)。
您可能需要添加 get
for(j in race){
for (i in race){
df_1 <- subset(x, get(i) == 1)
df_2 <- subset(x, get(j) == 1)
print(paste(i, j, sep = " "))
print(t.test(df_1$DV, df_2$DV) )
}
}
在R
中,我们也可以用outer
来做到这一点
f1 <- function(u, v) list(t.test(x$DV[x[[u]] ==1], x$DV[x[[v]] == 1]))
out <- outer(race, race, FUN = Vectorize(f1))
out[1,1]
#[[1]]
# Welch Two Sample t-test
#data: x$DV[x[[u]] == 1] and x$DV[x[[v]] == 1]
#t = 0, df = 98, p-value = 1
#alternative hypothesis: true difference in means is not equal to 0
#95 percent confidence interval:
# -11.57133 11.57133
#sample estimates:
#mean of x mean of y
# 51 51
可以做成list
输出
lst1 <- setNames(lapply(out, I), outer(race, race, FUN = paste)
我正在尝试遍历具有相同值的列和子集数据。
见下文。
White <- rep(0:1, 50)
Latino <- rep(0:1, 50)
Black <- rep(0:1, 50)
Asian <- rep(0:1, 50)
DV <- seq(1: length(rep(0:1, 50)))
x <- data.frame(cbind(White, Latino, Black, Asian, DV))
race <- c("White", "Latino", "Black", "Asian")
for(j in race){
for (i in race){
df_1 <- subset(x, i == 1)
df_2 <- subset(x, j == 1)
print(paste(i, j, sep = " "))
print(t.test(df_1$DV, df_2$DV) )
}
}
不幸的是,r 不喜欢 i 或 j 单独存在。如果有人知道一种更好的遍历列以对相同值进行子集化的方法,将不胜感激。谢谢
请注意,您代码中的 i
和 j
是一个字符串,但实际上您想要提取该列,例如
for(j in race){
for (i in race){
df_1 <- subset(x, x[,i] == 1)
df_2 <- subset(x, x[,j] == 1)
print(paste(i, j, sep = " "))
print(t.test(df_1$DV, df_2$DV) )
}
}
关于更好的循环方式,虚拟变量White
、Latino
、Black
和Asian
似乎是互斥的,因此,也许我们可以将数据重新排列成
race DV
------------
1 Black 1
2 White 2
3 Latino 3
4 Black 4
5 Asian 5
并使用公式调用 t.test
,例如
# generate synthetic data
rnd.race <- sample(1:4, 50, replace=T)
x <- data.frame(
White = as.integer(rnd.race == 1),
Latino = as.integer(rnd.race == 2),
Black = as.integer(rnd.race == 3),
Asian = as.integer(rnd.race == 4),
DV = seq(1: length(rep(0:1, 50)))
)
race <- c("White", "Latino", "Black", "Asian")
# rearrange data, gather columns of dummy variables
x.cleaned = data.frame(
race = race[apply(x[,1:4], 1, which.max)],
DV = x$DV
)
t.test( DV ~ race, data=x.cleaned, race %in% c("White", "Black"))
#
# Welch Two Sample t-test
#
# data: DV by race
# t = -0.91517, df = 42.923, p-value = 0.3652
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -25.241536 9.483961
# sample estimates:
# mean in group Black mean in group White
# 47.66667 55.54545
#
将 t.test
与公式一起使用的最大好处是它的可读性。例如,在 t.test
的报告中,而不是 mean in group x
和 mean in group y
,它会说 mean in group Black
、mean in group White
,并且公式本身声明变量在我们正在测试协变。
到运行 t-test 迭代所有对,我们可以
run.test = function(race.pair) {
list(t.test(DV ~ race, data=x.cleaned, race %in% race.pair) )
}
combn(race, 2, FUN = run.test)
# [[1]]
#
# Welch Two Sample t-test
#
# data: DV by race
# t = -0.30892, df = 41.997, p-value = 0.7589
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -21.22870 15.59233
# sample estimates:
# mean in group Latino mean in group White
# 52.72727 55.54545
#
#
# [[2]]
#
# Welch Two Sample t-test
#
# data: DV by race
# t = -0.91517, df = 42.923, p-value = 0.3652
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -25.241536 9.483961
# sample estimates:
# mean in group Black mean in group White
# 47.66667 55.54545
#
# ...
其中 combn(x, m, FUN = NULL, simplify = TRUE, ...)
是一个内置函数,用于一次生成 x
元素的所有组合 m
。有关使用 outer
的更多生成案例,请参阅
最后,恕我直言,在比较三个或更多组之间的均值时,方差分析可能比 t-test 得到更广泛的认可(也可能暗示为什么 "inconvenient" 迭代地使用 t-test对组)。
使用x.cleaned
,我们可以轻松地在R中使用方差分析,如:
aov.out = aov(DV ~ race, data=x.cleaned)
summary(aov.out)
请注意,在 one-way 方差分析(测试某些组的平均值是否不同)之后,我们还可以 运行 Post 随机测试(如 TukeyHSD(aov.out)
)以找出特定的对组有不同的意思。在正式报告中,一些假设测试也是 de rigueur。 Here is a lecture notes related to this. And this 是 Cross-Validated 上的一个相关问题(可以回答有关选择哪个测试的更多问题)。
您可能需要添加 get
for(j in race){
for (i in race){
df_1 <- subset(x, get(i) == 1)
df_2 <- subset(x, get(j) == 1)
print(paste(i, j, sep = " "))
print(t.test(df_1$DV, df_2$DV) )
}
}
在R
中,我们也可以用outer
f1 <- function(u, v) list(t.test(x$DV[x[[u]] ==1], x$DV[x[[v]] == 1]))
out <- outer(race, race, FUN = Vectorize(f1))
out[1,1]
#[[1]]
# Welch Two Sample t-test
#data: x$DV[x[[u]] == 1] and x$DV[x[[v]] == 1]
#t = 0, df = 98, p-value = 1
#alternative hypothesis: true difference in means is not equal to 0
#95 percent confidence interval:
# -11.57133 11.57133
#sample estimates:
#mean of x mean of y
# 51 51
可以做成list
输出
lst1 <- setNames(lapply(out, I), outer(race, race, FUN = paste)