运行 多列条件下的方差分析
Running anova on multiple column condition
这是我的数据框子集的样子。
a <- dput(head(mrna.pcs))
structure(list(Mouse.ID = c("DO.0661", "DO.0669", "DO.0670",
"DO.0673", "DO.0674", "DO.0676"), Sex = structure(c(1L, 1L, 1L,
1L, 1L, 1L), .Label = c("F", "M"), class = "factor"), fAge = structure(c(2L,
3L, 2L, 3L, 2L, 2L), .Label = c("6", "12", "18"), class = "factor"),
Index = structure(c(21L, 24L, 11L, 20L, 12L, 19L), .Label = c("AR001",
"AR002", "AR003", "AR004", "AR005", "AR006", "AR007", "AR008",
"AR009", "AR010", "AR011", "AR012", "AR013", "AR014", "AR015",
"AR016", "AR018", "AR019", "AR020", "AR021", "AR022", "AR023",
"AR025", "AR027"), class = "factor"), Lane = structure(c(6L,
2L, 4L, 5L, 5L, 4L), .Label = c("1", "2", "3", "4", "5",
"6", "7", "8"), class = "factor"), Gen = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("8", "9", "10", "11", "12"
), class = "factor"), PC1 = c(-23.147618298858, -23.004329868562,
-17.0024755772689, -23.9178589007844, -56.7766982399411,
-34.3969872418573), PC2 = c(40.5243564641241, 2.99206119995141,
-61.4176842149059, 7.10965422446634, 7.28461966315024, -64.1955797075099
), PC3 = c(-17.0598627155672, -22.1038475592448, -6.25238299099893,
23.500307567532, 53.4553992426852, -20.1077749520339), PC4 = c(-5.37605681469604,
28.8757760174757, 1.96723351126677, 10.1757811517044, 7.63553142427313,
-0.61083387825962), PC5 = c(2.49156058897602, -2.2801673669604,
-5.45494631567109, -5.44682692111089, -7.21616736676726,
-11.0786655194642), PC6 = c(-11.625850369587, 1.54093546690149,
-4.87370378395642, -22.0735137415442, -2.44337914021456,
0.619440592140127), PC7 = c(7.20873385839409, -17.719801994905,
-0.811301497692041, 7.55418040146638, -4.68437054723712,
1.1158744957288), PC8 = c(-7.19678837565302, 6.24827779166403,
0.224651092284126, 6.10960416152842, -14.6615234719377, -0.410198021192528
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
数据框
Mouse.ID Sex fAge Index Lane Gen PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
<chr> <fct> <fct> <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 DO.0661 F 12 AR022 6 8 -23.1 40.5 -17.1 -5.38 2.49 -11.6 7.21 -7.20
2 DO.0669 F 18 AR027 2 8 -23.0 2.99 -22.1 28.9 -2.28 1.54 -17.7 6.25
3 DO.0670 F 12 AR011 4 8 -17.0 -61.4 -6.25 1.97 -5.45 -4.87 -0.811 0.225
4 DO.0673 F 18 AR021 5 8 -23.9 7.11 23.5 10.2 -5.45 -22.1 7.55 6.11
5 DO.0674 F 12 AR012 5 8 -56.8 7.28 53.5 7.64 -7.22 -2.44 -4.68 -14.7
6 DO.0676 F 12 AR020 4 8 -34.4 -64.2 -20.1 -0.611 -11.1 0.619 1.12 -0.410
我的 objective 是 运行 我的主成分和这里的变量之间的方差分析 Sex,fAge,Index,Lane,Gen .
现在的样子运行是这样的
PC1
anova(lm(PC1 ~ Sex*fAge, data=mrna.pcs))
PC2
anova(lm(PC2 ~ Sex*fAge, data=mrna.pcs))
对于 PC3
anova(lm(PC3 ~ Sex*fAge, data=mrna.pcs))
与其他 PC 类似
anova(lm(PC4 ~ Sex*fAge, data=mrna.pcs))
anova(lm(PC5 ~ Sex*fAge, data=mrna.pcs))
anova(lm(PC6 ~ Sex*fAge, data=mrna.pcs))
anova(lm(PC7 ~ Sex*fAge, data=mrna.pcs))
anova(lm(PC8 ~ Sex*fAge, data=mrna.pcs))
所以这些仅适用于性别和年龄,如果我必须 运行 其余的预测变量,我必须单独 运行 它们。在这里没关系,因为我的个人电脑数量很少,但我有数据 运行 进入相当多的个人电脑和其他 traits/predictors。
所以我的问题是如何将它们一次性设置为运行,以便它可以测试每台 PC 与所有预测变量。
例如
PC1 ~ Sex
PC1 ~ Sex+fAge
PC1 ~ Sex+fAge+Index
PC1 ~ Sex+fAge+Index+Lane
PC1 ~ Sex+fAge+Index+Lane+Gen
其他PC也一样
正如 Axeman 指出的那样,盲目尝试所有可能的回归排列是个坏主意。这种“钓鱼探险”方法极有可能导致虚假结果。
也就是说,您可以通过以下方式生成大量公式,然后将它们应用于您的数据集。由于您的示例数据集仅包含 6 行,因此实际上没有足够的数据 运行 最后一步,但它应该可以工作。在这里,我使用 expand.grid
生成许多不同的公式,然后使用 lapply
到 运行 它们与数据相对应。
rhs <- c(
'Sex',
'Sex+fAge',
'Sex+fAge+Index',
'Sex+fAge+Index+Lane',
'Sex+fAge+Index+Lane+Gen'
)
dv <- paste0('PC', 1:8)
frms <- with(expand.grid(dv, rhs), paste(Var1, Var2, sep = ' ~ '))
models <- lapply(frms, function(x) anova(lm(x, data = mrna.pcs)))
names(models) <- frms # so that you can see which formula belongs to which output
或者,您可以使用 combn
从预测变量列表中生成所有可能的预测变量组合,而不是使用一组预定义的公式。从那里开始,其余的解决方案是相同的。
iv <- c("Sex", "fAge", "Index", "Lane", "Gen")
dv <- paste0('PC', 1:8)
rhs <- unlist(sapply(1:length(iv), function(m) apply(combn(iv, m = m), 2, paste, collapse = ' + ')))
frms <- with(expand.grid(dv, rhs), paste(Var1, Var2, sep = ' ~ '))
models <- lapply(frms, function(x) anova(lm(x, data = mtcars)))
names(models) <- frms
使用 combn
从 5 个给定的预测变量生成 31 个公式:
[1] "Sex" "fAge"
[3] "Index" "Lane"
[5] "Gen" "Sex + fAge"
[7] "Sex + Index" "Sex + Lane"
[9] "Sex + Gen" "fAge + Index"
[11] "fAge + Lane" "fAge + Gen"
[13] "Index + Lane" "Index + Gen"
[15] "Lane + Gen" "Sex + fAge + Index"
[17] "Sex + fAge + Lane" "Sex + fAge + Gen"
[19] "Sex + Index + Lane" "Sex + Index + Gen"
[21] "Sex + Lane + Gen" "fAge + Index + Lane"
[23] "fAge + Index + Gen" "fAge + Lane + Gen"
[25] "Index + Lane + Gen" "Sex + fAge + Index + Lane"
[27] "Sex + fAge + Index + Gen" "Sex + fAge + Lane + Gen"
[29] "Sex + Index + Lane + Gen" "fAge + Index + Lane + Gen"
[31] "Sex + fAge + Index + Lane + Gen"
然后将其与因变量结合起来,总共有 248 个公式。
在 PC 上考虑 lapply
,在 non-PC 列的所有组合上考虑 combn
,在公式构建
上考虑 reformulate
# RESPONSE AND TRAIT COLUMN VECTORS
PC_cols <- names(mrna.pcs)[grep("PC", names(mrna.pcs))]
traits <- names(mrna.pcs)[-1][grep("PC", names(mrna.pcs)[-1], invert=TRUE)]
# GENERALILZED METHOD TO RUN DYNAMIC MODEL
run_model <- function(PC, traits) {
fml <- reformulate(traits, response=PC)
anova(lm(fml, data=mrna.pcs))
}
# NAMED LIST OF ANOVA OBJECTS
anova_list <- sapply(
PC_cols,
function(PC) lapply(
seq_along(traits),
function(i) combn(traits, i, FUN=function(t) run_model(PC, t))
),
simplify = FALSE
)
# ACCESS ELEMENTS
anova_list$PC1
anova_list$PC2
anova_list$PC3
...
anova_list$PC8
要调试任何有问题的自变量和因变量,请使用匿名函数打印迭代变量。发生错误时检查最后打印的项目。
anova_list <- sapply(
PC_cols,
function(PC) {
print(PC)
lapply(
seq_along(traits),
function(i) {
print(traits)
combn(traits, i, FUN=function(t) run_model(PC, t))
}
)
},
simplify = FALSE
)
这是我的数据框子集的样子。
a <- dput(head(mrna.pcs))
structure(list(Mouse.ID = c("DO.0661", "DO.0669", "DO.0670",
"DO.0673", "DO.0674", "DO.0676"), Sex = structure(c(1L, 1L, 1L,
1L, 1L, 1L), .Label = c("F", "M"), class = "factor"), fAge = structure(c(2L,
3L, 2L, 3L, 2L, 2L), .Label = c("6", "12", "18"), class = "factor"),
Index = structure(c(21L, 24L, 11L, 20L, 12L, 19L), .Label = c("AR001",
"AR002", "AR003", "AR004", "AR005", "AR006", "AR007", "AR008",
"AR009", "AR010", "AR011", "AR012", "AR013", "AR014", "AR015",
"AR016", "AR018", "AR019", "AR020", "AR021", "AR022", "AR023",
"AR025", "AR027"), class = "factor"), Lane = structure(c(6L,
2L, 4L, 5L, 5L, 4L), .Label = c("1", "2", "3", "4", "5",
"6", "7", "8"), class = "factor"), Gen = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("8", "9", "10", "11", "12"
), class = "factor"), PC1 = c(-23.147618298858, -23.004329868562,
-17.0024755772689, -23.9178589007844, -56.7766982399411,
-34.3969872418573), PC2 = c(40.5243564641241, 2.99206119995141,
-61.4176842149059, 7.10965422446634, 7.28461966315024, -64.1955797075099
), PC3 = c(-17.0598627155672, -22.1038475592448, -6.25238299099893,
23.500307567532, 53.4553992426852, -20.1077749520339), PC4 = c(-5.37605681469604,
28.8757760174757, 1.96723351126677, 10.1757811517044, 7.63553142427313,
-0.61083387825962), PC5 = c(2.49156058897602, -2.2801673669604,
-5.45494631567109, -5.44682692111089, -7.21616736676726,
-11.0786655194642), PC6 = c(-11.625850369587, 1.54093546690149,
-4.87370378395642, -22.0735137415442, -2.44337914021456,
0.619440592140127), PC7 = c(7.20873385839409, -17.719801994905,
-0.811301497692041, 7.55418040146638, -4.68437054723712,
1.1158744957288), PC8 = c(-7.19678837565302, 6.24827779166403,
0.224651092284126, 6.10960416152842, -14.6615234719377, -0.410198021192528
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
数据框
Mouse.ID Sex fAge Index Lane Gen PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
<chr> <fct> <fct> <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 DO.0661 F 12 AR022 6 8 -23.1 40.5 -17.1 -5.38 2.49 -11.6 7.21 -7.20
2 DO.0669 F 18 AR027 2 8 -23.0 2.99 -22.1 28.9 -2.28 1.54 -17.7 6.25
3 DO.0670 F 12 AR011 4 8 -17.0 -61.4 -6.25 1.97 -5.45 -4.87 -0.811 0.225
4 DO.0673 F 18 AR021 5 8 -23.9 7.11 23.5 10.2 -5.45 -22.1 7.55 6.11
5 DO.0674 F 12 AR012 5 8 -56.8 7.28 53.5 7.64 -7.22 -2.44 -4.68 -14.7
6 DO.0676 F 12 AR020 4 8 -34.4 -64.2 -20.1 -0.611 -11.1 0.619 1.12 -0.410
我的 objective 是 运行 我的主成分和这里的变量之间的方差分析 Sex,fAge,Index,Lane,Gen .
现在的样子运行是这样的
PC1
anova(lm(PC1 ~ Sex*fAge, data=mrna.pcs))
PC2
anova(lm(PC2 ~ Sex*fAge, data=mrna.pcs))
对于 PC3
anova(lm(PC3 ~ Sex*fAge, data=mrna.pcs))
与其他 PC 类似
anova(lm(PC4 ~ Sex*fAge, data=mrna.pcs))
anova(lm(PC5 ~ Sex*fAge, data=mrna.pcs))
anova(lm(PC6 ~ Sex*fAge, data=mrna.pcs))
anova(lm(PC7 ~ Sex*fAge, data=mrna.pcs))
anova(lm(PC8 ~ Sex*fAge, data=mrna.pcs))
所以这些仅适用于性别和年龄,如果我必须 运行 其余的预测变量,我必须单独 运行 它们。在这里没关系,因为我的个人电脑数量很少,但我有数据 运行 进入相当多的个人电脑和其他 traits/predictors。
所以我的问题是如何将它们一次性设置为运行,以便它可以测试每台 PC 与所有预测变量。
例如
PC1 ~ Sex
PC1 ~ Sex+fAge
PC1 ~ Sex+fAge+Index
PC1 ~ Sex+fAge+Index+Lane
PC1 ~ Sex+fAge+Index+Lane+Gen
其他PC也一样
正如 Axeman 指出的那样,盲目尝试所有可能的回归排列是个坏主意。这种“钓鱼探险”方法极有可能导致虚假结果。
也就是说,您可以通过以下方式生成大量公式,然后将它们应用于您的数据集。由于您的示例数据集仅包含 6 行,因此实际上没有足够的数据 运行 最后一步,但它应该可以工作。在这里,我使用 expand.grid
生成许多不同的公式,然后使用 lapply
到 运行 它们与数据相对应。
rhs <- c(
'Sex',
'Sex+fAge',
'Sex+fAge+Index',
'Sex+fAge+Index+Lane',
'Sex+fAge+Index+Lane+Gen'
)
dv <- paste0('PC', 1:8)
frms <- with(expand.grid(dv, rhs), paste(Var1, Var2, sep = ' ~ '))
models <- lapply(frms, function(x) anova(lm(x, data = mrna.pcs)))
names(models) <- frms # so that you can see which formula belongs to which output
或者,您可以使用 combn
从预测变量列表中生成所有可能的预测变量组合,而不是使用一组预定义的公式。从那里开始,其余的解决方案是相同的。
iv <- c("Sex", "fAge", "Index", "Lane", "Gen")
dv <- paste0('PC', 1:8)
rhs <- unlist(sapply(1:length(iv), function(m) apply(combn(iv, m = m), 2, paste, collapse = ' + ')))
frms <- with(expand.grid(dv, rhs), paste(Var1, Var2, sep = ' ~ '))
models <- lapply(frms, function(x) anova(lm(x, data = mtcars)))
names(models) <- frms
使用 combn
从 5 个给定的预测变量生成 31 个公式:
[1] "Sex" "fAge"
[3] "Index" "Lane"
[5] "Gen" "Sex + fAge"
[7] "Sex + Index" "Sex + Lane"
[9] "Sex + Gen" "fAge + Index"
[11] "fAge + Lane" "fAge + Gen"
[13] "Index + Lane" "Index + Gen"
[15] "Lane + Gen" "Sex + fAge + Index"
[17] "Sex + fAge + Lane" "Sex + fAge + Gen"
[19] "Sex + Index + Lane" "Sex + Index + Gen"
[21] "Sex + Lane + Gen" "fAge + Index + Lane"
[23] "fAge + Index + Gen" "fAge + Lane + Gen"
[25] "Index + Lane + Gen" "Sex + fAge + Index + Lane"
[27] "Sex + fAge + Index + Gen" "Sex + fAge + Lane + Gen"
[29] "Sex + Index + Lane + Gen" "fAge + Index + Lane + Gen"
[31] "Sex + fAge + Index + Lane + Gen"
然后将其与因变量结合起来,总共有 248 个公式。
在 PC 上考虑 lapply
,在 non-PC 列的所有组合上考虑 combn
,在公式构建
reformulate
# RESPONSE AND TRAIT COLUMN VECTORS
PC_cols <- names(mrna.pcs)[grep("PC", names(mrna.pcs))]
traits <- names(mrna.pcs)[-1][grep("PC", names(mrna.pcs)[-1], invert=TRUE)]
# GENERALILZED METHOD TO RUN DYNAMIC MODEL
run_model <- function(PC, traits) {
fml <- reformulate(traits, response=PC)
anova(lm(fml, data=mrna.pcs))
}
# NAMED LIST OF ANOVA OBJECTS
anova_list <- sapply(
PC_cols,
function(PC) lapply(
seq_along(traits),
function(i) combn(traits, i, FUN=function(t) run_model(PC, t))
),
simplify = FALSE
)
# ACCESS ELEMENTS
anova_list$PC1
anova_list$PC2
anova_list$PC3
...
anova_list$PC8
要调试任何有问题的自变量和因变量,请使用匿名函数打印迭代变量。发生错误时检查最后打印的项目。
anova_list <- sapply(
PC_cols,
function(PC) {
print(PC)
lapply(
seq_along(traits),
function(i) {
print(traits)
combn(traits, i, FUN=function(t) run_model(PC, t))
}
)
},
simplify = FALSE
)