如何对由多个分组变量分隔的个人进行的几个变量进行方差分析?
How to conduct an ANOVA of several variables taken on individuals separated by multiple grouping variables?
我有一个类似于下面代码创建的数据框。在这个例子中,5 个变量的测量是对 30 个由 ID
表示的个体进行的。个体可以被三个分组变量中的任何一个分开:GroupVar1,GroupVar2,GroupVar3
。对于每个分组变量,我需要对 5 个变量中的每一个进行方差分析,并且 return 每个变量的结果(可能是 pdf 或单独的文档?)。我如何编写函数或使用迭代来处理此问题并最大程度地减少代码中的重复?如果你有一个大数据集(我的真实数据集有几百个人,分组变量的大小范围从 6 到 30 组),提取和可视化结果的最佳方法是什么?
library(tidyverse)
GroupVar1 <- rep(c("FL", "GA", "SC", "NC", "VA", "GA"), each = 5)
GroupVar2 <- rep(c("alpha", "beta", "gamma"), each = 10)
GroupVar3 <- rep(c("Bravo", "Charlie", "Delta", "Echo"), times = c(7,8,10,5))
ID <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y","Z", "a","b","c","d")
Var1 <- rnorm(30)
Var2 <- rnorm(30)
Var3 <- rnorm(30)
Var4 <- rnorm(30)
Var5 <- rnorm(30)
data <- tibble(GroupVar1,GroupVar2,GroupVar3,ID,Var1,Var2,Var3,Var4,Var5)
> dput(data[1:10,])
structure(list(Location = structure(c(21L, 21L, 21L, 21L, 21L,
21L, 21L, 21L, 21L, 21L), .Label = c("ALTE", "ASTR", "BREA",
"CAMN", "CFU", "COEN", "JENT", "NAT", "NEAU", "NOCO", "OOGG",
"OPMM", "PING", "PITC", "POMO", "REAN", "ROND", "RTD", "SANT",
"SMIT", "SUN", "TEAR", "WINC"), class = "factor"), PR = structure(c(16L,
16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L), .Label = c("ALTE",
"ASTR", "CF", "CHOW", "JENT", "NAT", "NEAU", "NSE", "OOGG", "PALM",
"POMO", "REAN", "ROND", "RTD", "SS", "SUN", "WINC"), class = "factor"),
Est = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("AS",
"CB", "CF", "CS", "OS", "PS", "SS", "WB"), class = "factor"),
State = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L
), .Label = c("FL", "GA", "MD", "NC", "SC", "VA"), class = "factor"),
Year = c(2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L), ID = c(90L, 92L, 93L, 95L, 96L, 98L,
99L, 100L, 103L, 109L), Sex = structure(c(1L, 2L, 2L, 2L,
1L, 1L, 2L, 1L, 1L, 2L), .Label = c("F", "M"), class = "factor"),
DOB = c(-0.674706816, 2.10472846, 0.279952847, -0.26959379,
-1.243977657, 0.188828771, 0.026530709, 0.483363306, -0.63599302,
-0.979506001), Mg = c(-1.409815618, 1.180920604, 0.765102543,
1.828057339, -0.689841498, -0.604272366, 0.194867939, -1.015964127,
-0.520136693, 0.769042585), Mn7 = c(1.387385913, 0.320582444,
-0.490356598, -0.020540649, -0.594210249, -1.119170306, -0.225065868,
-1.892064456, -2.434101506, -0.816518662), Cu7 = c(-0.176599651,
0.100529267, 1.4967142, 0.094840221, 1.791653259, -0.191723817,
-1.526868086, -0.308696916, -2.046613977, -2.228513411),
Zn7 = c(-0.338454617, -0.235800727, -0.785876374, 0.114698826,
0.202960987, 0.432013987, 0.164099621, 0.609232311, 0.169329098,
-0.284402654), Sr7 = c(-0.010929071, -1.616835312, -0.208856,
-0.362538736, 1.662066318, -0.893155185, 0.699406559, -0.333176495,
-2.026364633, -1.324456127), Ba7 = c(-1.041126455, 0.551165907,
0.126849272, -1.069762666, -0.922501551, -1.36095076, 1.57800858,
-0.842518997, -1.017894235, 0.265895019)), row.names = c(NA,
10L), class = "data.frame")
在不太了解基础数据的情况下,我的直觉是这可能是对方差分析的不当使用。我建议您 post 到 Cross Validated 以确认您没有违反这里的任何假设。
无论如何,这是我用来解决所提出问题的代码:
# We will use dplyr, tidyr, purrr, stats, and broom to accomplish this
# I am using tidyr v1.0.0. For older versions you will need to modify code for pivot_longer
results <- data %>%
# First pivot the data longer so each dependent variable is on its own row
pivot_longer(
cols = Var1:Var5,
names_to = "name",
values_to = "value"
) %>%
# Second, pivot longer again, so each row is now its unique grouping var
pivot_longer(
cols = GroupVar1:GroupVar3,
names_to = "group_name",
values_to = "group_value"
) %>%
# group by both group name and dependent variable
group_by(name, group_name) %>%
# nest the data, so each dataset is unique for each dependent and independent variable
nest() %>%
mutate(
# run an anova on each nested data frame
anova = map(data, ~aov(data = .x, value ~ group_value)), # may need to change aov() call here
# use broom to tidy the output
tidied_results = map(anova, broom::tidy)
)
# To easily access the ANOVA results, you can do something like the following:
results %>%
# select columns of interest
select(name, group_name, tidied_results) %>%
# unnest to access summary information of ANOVA
unnest(cols = c(tidied_results))
我认为您还需要使用某种多重比较校正,例如 Bonferroni 校正。同样,Cross Validated 可以在这里引导您朝着正确的方向前进。
根据带有输入数据的更新问题编辑答案:
假设代表分组变量的列是 1:5 和 7,并且假设因数值变量在列 8:14 中,这可以使用双循环完成,没有其他依赖关系:
tests <- list()
Groups <- c(1:5, 7)
Variables <- 8:14
for(i in Groups)
{
Group <- as.factor(data[[i]])
for(j in Variables)
{
test_name <- paste0(names(data)[j], "_by_", names(data[i]))
Response <- data[[j]]
tests[[test_name]] <- anova(lm(Response ~ Group))
}
}
现在您可以使用 lapply
对所有这些测试进行您喜欢的操作,例如
lapply(tests, print)
我同意@DaveGruenewald 关于多重假设检验的观点 - 事实上,这个例子很好地说明了为什么需要 Buonferroni 或 Sidak 的修正,因为(正如预期的那样)有几个 "significant" p 值在随机数据中,仅仅是因为涉及的测试数量。
我有一个类似于下面代码创建的数据框。在这个例子中,5 个变量的测量是对 30 个由 ID
表示的个体进行的。个体可以被三个分组变量中的任何一个分开:GroupVar1,GroupVar2,GroupVar3
。对于每个分组变量,我需要对 5 个变量中的每一个进行方差分析,并且 return 每个变量的结果(可能是 pdf 或单独的文档?)。我如何编写函数或使用迭代来处理此问题并最大程度地减少代码中的重复?如果你有一个大数据集(我的真实数据集有几百个人,分组变量的大小范围从 6 到 30 组),提取和可视化结果的最佳方法是什么?
library(tidyverse)
GroupVar1 <- rep(c("FL", "GA", "SC", "NC", "VA", "GA"), each = 5)
GroupVar2 <- rep(c("alpha", "beta", "gamma"), each = 10)
GroupVar3 <- rep(c("Bravo", "Charlie", "Delta", "Echo"), times = c(7,8,10,5))
ID <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y","Z", "a","b","c","d")
Var1 <- rnorm(30)
Var2 <- rnorm(30)
Var3 <- rnorm(30)
Var4 <- rnorm(30)
Var5 <- rnorm(30)
data <- tibble(GroupVar1,GroupVar2,GroupVar3,ID,Var1,Var2,Var3,Var4,Var5)
> dput(data[1:10,])
structure(list(Location = structure(c(21L, 21L, 21L, 21L, 21L,
21L, 21L, 21L, 21L, 21L), .Label = c("ALTE", "ASTR", "BREA",
"CAMN", "CFU", "COEN", "JENT", "NAT", "NEAU", "NOCO", "OOGG",
"OPMM", "PING", "PITC", "POMO", "REAN", "ROND", "RTD", "SANT",
"SMIT", "SUN", "TEAR", "WINC"), class = "factor"), PR = structure(c(16L,
16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L), .Label = c("ALTE",
"ASTR", "CF", "CHOW", "JENT", "NAT", "NEAU", "NSE", "OOGG", "PALM",
"POMO", "REAN", "ROND", "RTD", "SS", "SUN", "WINC"), class = "factor"),
Est = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("AS",
"CB", "CF", "CS", "OS", "PS", "SS", "WB"), class = "factor"),
State = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L
), .Label = c("FL", "GA", "MD", "NC", "SC", "VA"), class = "factor"),
Year = c(2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L), ID = c(90L, 92L, 93L, 95L, 96L, 98L,
99L, 100L, 103L, 109L), Sex = structure(c(1L, 2L, 2L, 2L,
1L, 1L, 2L, 1L, 1L, 2L), .Label = c("F", "M"), class = "factor"),
DOB = c(-0.674706816, 2.10472846, 0.279952847, -0.26959379,
-1.243977657, 0.188828771, 0.026530709, 0.483363306, -0.63599302,
-0.979506001), Mg = c(-1.409815618, 1.180920604, 0.765102543,
1.828057339, -0.689841498, -0.604272366, 0.194867939, -1.015964127,
-0.520136693, 0.769042585), Mn7 = c(1.387385913, 0.320582444,
-0.490356598, -0.020540649, -0.594210249, -1.119170306, -0.225065868,
-1.892064456, -2.434101506, -0.816518662), Cu7 = c(-0.176599651,
0.100529267, 1.4967142, 0.094840221, 1.791653259, -0.191723817,
-1.526868086, -0.308696916, -2.046613977, -2.228513411),
Zn7 = c(-0.338454617, -0.235800727, -0.785876374, 0.114698826,
0.202960987, 0.432013987, 0.164099621, 0.609232311, 0.169329098,
-0.284402654), Sr7 = c(-0.010929071, -1.616835312, -0.208856,
-0.362538736, 1.662066318, -0.893155185, 0.699406559, -0.333176495,
-2.026364633, -1.324456127), Ba7 = c(-1.041126455, 0.551165907,
0.126849272, -1.069762666, -0.922501551, -1.36095076, 1.57800858,
-0.842518997, -1.017894235, 0.265895019)), row.names = c(NA,
10L), class = "data.frame")
在不太了解基础数据的情况下,我的直觉是这可能是对方差分析的不当使用。我建议您 post 到 Cross Validated 以确认您没有违反这里的任何假设。
无论如何,这是我用来解决所提出问题的代码:
# We will use dplyr, tidyr, purrr, stats, and broom to accomplish this
# I am using tidyr v1.0.0. For older versions you will need to modify code for pivot_longer
results <- data %>%
# First pivot the data longer so each dependent variable is on its own row
pivot_longer(
cols = Var1:Var5,
names_to = "name",
values_to = "value"
) %>%
# Second, pivot longer again, so each row is now its unique grouping var
pivot_longer(
cols = GroupVar1:GroupVar3,
names_to = "group_name",
values_to = "group_value"
) %>%
# group by both group name and dependent variable
group_by(name, group_name) %>%
# nest the data, so each dataset is unique for each dependent and independent variable
nest() %>%
mutate(
# run an anova on each nested data frame
anova = map(data, ~aov(data = .x, value ~ group_value)), # may need to change aov() call here
# use broom to tidy the output
tidied_results = map(anova, broom::tidy)
)
# To easily access the ANOVA results, you can do something like the following:
results %>%
# select columns of interest
select(name, group_name, tidied_results) %>%
# unnest to access summary information of ANOVA
unnest(cols = c(tidied_results))
我认为您还需要使用某种多重比较校正,例如 Bonferroni 校正。同样,Cross Validated 可以在这里引导您朝着正确的方向前进。
根据带有输入数据的更新问题编辑答案:
假设代表分组变量的列是 1:5 和 7,并且假设因数值变量在列 8:14 中,这可以使用双循环完成,没有其他依赖关系:
tests <- list()
Groups <- c(1:5, 7)
Variables <- 8:14
for(i in Groups)
{
Group <- as.factor(data[[i]])
for(j in Variables)
{
test_name <- paste0(names(data)[j], "_by_", names(data[i]))
Response <- data[[j]]
tests[[test_name]] <- anova(lm(Response ~ Group))
}
}
现在您可以使用 lapply
对所有这些测试进行您喜欢的操作,例如
lapply(tests, print)
我同意@DaveGruenewald 关于多重假设检验的观点 - 事实上,这个例子很好地说明了为什么需要 Buonferroni 或 Sidak 的修正,因为(正如预期的那样)有几个 "significant" p 值在随机数据中,仅仅是因为涉及的测试数量。