如何 运行 控制 R 中多个变量的多重 t 检验或方差分析?
How to run a multiple t-tests or ANOVA that controls for multiple variables in R?
我有 df1:
Rate Dogs MHI_2018 Points Level AGE65_MORE P_Elderly
1 0.10791173 0.00000000 59338 236.4064 C 8653 15.56267
2 0.06880040 0.00000000 57588 229.4343 C 44571 20.44335
3 0.08644537 0.00000000 50412 200.8446 C 10548 18.23651
4 0.29591635 0.00000000 29267 116.6016 A 1661 16.38390
5 0.05081301 0.00000000 37365 148.8645 B 3995 20.29980
6 0.02625200 0.00000000 45400 180.8765 D 20247 17.71748
7 0.80321285 0.02974862 39917 159.0319 D 6562 19.52105
8 0.07682852 0.00000000 42132 167.8566 D 5980 22.97173
9 0.18118814 0.00000000 47547 189.4303 B 7411 16.78482
10 0.07787555 0.00000000 39907 158.9920 B 2953 22.99665
11 0.15065913 0.00000000 39201 156.1793 C 2751 20.72316
12 0.33362247 0.00000000 46495 185.2390 B 2915 19.45019
13 0.03652168 0.00000000 49055 195.4382 B 10914 19.92988
14 0.27998133 0.00000000 42423 169.0159 A 2481 23.15446
15 0.05407451 0.00000000 40203 160.1713 A 7790 21.06202
16 0.07233796 0.00000000 39057 155.6056 A 2629 19.01765
17 0.08389061 0.00000000 45796 182.4542 B 15446 18.51106
18 0.05220569 0.00000000 34035 135.5976 B 6921 18.06578
19 0.05603418 0.00000000 39491 157.3347 B 12322 17.26133
20 0.15875536 0.00000000 60367 240.5060 C 12400 15.14282
我想测试四个不同 Level
组(A、B、C、D)的 Rate
的均值是否存在显着差异。我知道如果水平上有两组,我通常可以 运行 进行 t 检验,但由于有四个组,我想我可以 运行 6 个 t 检验,或者我可以 运行 方差分析,方差分析 运行 是如何解释的?
此外,我想看看变量 P_Elderly
是否是一个重要的协变量,它可以解释 Level
和 Rate
之间的一些关系。如果我有其他协变量想稍后添加,我该怎么做?
您可以从方差分析开始,然后使用 TukeyHSD
函数获取每个比较的 p 值:
AOV <- aov(Rate~Level, data = df)
Call:
aov(formula = Rate ~ Level, data = df)
Terms:
Level Residuals
Sum of Squares 0.0916076 0.5068768
Deg. of Freedom 3 16
Residual standard error: 0.1779882
Estimated effects may be unbalanced
TukeyHSD(AOV)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Rate ~ Level, data = df)
$Level
diff lwr upr p adj
B-A -0.066558621 -0.3783957 0.2452784 0.9272012
C-A -0.061063140 -0.4026635 0.2805372 0.9551663
D-A 0.126520253 -0.2624089 0.5154494 0.7890519
C-B 0.005495482 -0.2848090 0.2958000 0.9999404
D-B 0.193078874 -0.1516699 0.5378277 0.4049948
D-C 0.187583392 -0.1843040 0.5594708 0.4923479
它能回答您的问题吗?
可重现的例子
structure(list(Row = 1:20, Rate = c(0.10791173, 0.0688004, 0.08644537,
0.29591635, 0.05081301, 0.026252, 0.80321285, 0.07682852, 0.18118814,
0.07787555, 0.15065913, 0.33362247, 0.03652168, 0.27998133, 0.05407451,
0.07233796, 0.08389061, 0.05220569, 0.05603418, 0.15875536),
Dogs = c(0, 0, 0, 0, 0, 0, 0.02974862, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), MHI_2018 = c(59338L, 57588L, 50412L,
29267L, 37365L, 45400L, 39917L, 42132L, 47547L, 39907L, 39201L,
46495L, 49055L, 42423L, 40203L, 39057L, 45796L, 34035L, 39491L,
60367L), Points = c(236.4064, 229.4343, 200.8446, 116.6016,
148.8645, 180.8765, 159.0319, 167.8566, 189.4303, 158.992,
156.1793, 185.239, 195.4382, 169.0159, 160.1713, 155.6056,
182.4542, 135.5976, 157.3347, 240.506), Level = c("C", "C",
"C", "A", "B", "D", "D", "D", "B", "B", "C", "B", "B", "A",
"A", "A", "B", "B", "B", "C"), AGE65_MORE = c(8653L, 44571L,
10548L, 1661L, 3995L, 20247L, 6562L, 5980L, 7411L, 2953L,
2751L, 2915L, 10914L, 2481L, 7790L, 2629L, 15446L, 6921L,
12322L, 12400L), P_Elderly = c(15.56267, 20.44335, 18.23651,
16.3839, 20.2998, 17.71748, 19.52105, 22.97173, 16.78482,
22.99665, 20.72316, 19.45019, 19.92988, 23.15446, 21.06202,
19.01765, 18.51106, 18.06578, 17.26133, 15.14282)), row.names = c(NA,
-20L), class = c("data.table", "data.frame"))
您可以拟合线性模型,其中速率由级别解释:
fit0 = lm(Rate ~ Level,data=df)
可以看看系数:
coefs = coefficients(fit0)
coefs
(Intercept) LevelB LevelC LevelD
0.17557754 -0.06655862 -0.06106314 0.12652025
这里以A为参考,系数表示它们的均值与A的均值相差多少。所以我们可以测试Level B : D是否为零,即一个公共截距对这个模型足够了:
library(car)
linearHypothesis(fit0,names(coefs)[-1],test="F")
Linear hypothesis test
Hypothesis:
LevelB = 0
LevelC = 0
LevelD = 0
Model 1: restricted model
Model 2: Rate ~ Level
Res.Df RSS Df Sum of Sq F Pr(>F)
1 19 0.59848
2 16 0.50688 3 0.091608 0.9639 0.4338
这类似于方差分析,您可以一次性测试所有水平系数的显着性。
anova(fit0)
Analysis of Variance Table
Response: Rate
Df Sum Sq Mean Sq F value Pr(>F)
Level 3 0.09161 0.030536 0.9639 0.4338
Residuals 16 0.50688 0.031680
按照上面的方法,很可能方法没有太大的不同。您也可以像这样进行成对测试:
library(multcomp)
summary(glht(fit0,linfct = mcp(Level = "Tukey")))
对于你的下一个问题,如何添加协变量,你将拟合另一个模型:
fit_full = lm(Rate ~ Level+P_Elderly,data=df)
并将其与只有级别的模型进行比较:
anova(f0,fit_full)
Analysis of Variance Table
Model 1: Rate ~ Level
Model 2: Rate ~ Level + P_Elderly
Res.Df RSS Df Sum of Sq F Pr(>F)
1 16 0.50688
2 15 0.50150 1 0.0053721 0.1607 0.6942
反对,似乎老人没有太大的影响..
我有 df1:
Rate Dogs MHI_2018 Points Level AGE65_MORE P_Elderly
1 0.10791173 0.00000000 59338 236.4064 C 8653 15.56267
2 0.06880040 0.00000000 57588 229.4343 C 44571 20.44335
3 0.08644537 0.00000000 50412 200.8446 C 10548 18.23651
4 0.29591635 0.00000000 29267 116.6016 A 1661 16.38390
5 0.05081301 0.00000000 37365 148.8645 B 3995 20.29980
6 0.02625200 0.00000000 45400 180.8765 D 20247 17.71748
7 0.80321285 0.02974862 39917 159.0319 D 6562 19.52105
8 0.07682852 0.00000000 42132 167.8566 D 5980 22.97173
9 0.18118814 0.00000000 47547 189.4303 B 7411 16.78482
10 0.07787555 0.00000000 39907 158.9920 B 2953 22.99665
11 0.15065913 0.00000000 39201 156.1793 C 2751 20.72316
12 0.33362247 0.00000000 46495 185.2390 B 2915 19.45019
13 0.03652168 0.00000000 49055 195.4382 B 10914 19.92988
14 0.27998133 0.00000000 42423 169.0159 A 2481 23.15446
15 0.05407451 0.00000000 40203 160.1713 A 7790 21.06202
16 0.07233796 0.00000000 39057 155.6056 A 2629 19.01765
17 0.08389061 0.00000000 45796 182.4542 B 15446 18.51106
18 0.05220569 0.00000000 34035 135.5976 B 6921 18.06578
19 0.05603418 0.00000000 39491 157.3347 B 12322 17.26133
20 0.15875536 0.00000000 60367 240.5060 C 12400 15.14282
我想测试四个不同 Level
组(A、B、C、D)的 Rate
的均值是否存在显着差异。我知道如果水平上有两组,我通常可以 运行 进行 t 检验,但由于有四个组,我想我可以 运行 6 个 t 检验,或者我可以 运行 方差分析,方差分析 运行 是如何解释的?
此外,我想看看变量 P_Elderly
是否是一个重要的协变量,它可以解释 Level
和 Rate
之间的一些关系。如果我有其他协变量想稍后添加,我该怎么做?
您可以从方差分析开始,然后使用 TukeyHSD
函数获取每个比较的 p 值:
AOV <- aov(Rate~Level, data = df)
Call:
aov(formula = Rate ~ Level, data = df)
Terms:
Level Residuals
Sum of Squares 0.0916076 0.5068768
Deg. of Freedom 3 16
Residual standard error: 0.1779882
Estimated effects may be unbalanced
TukeyHSD(AOV)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Rate ~ Level, data = df)
$Level
diff lwr upr p adj
B-A -0.066558621 -0.3783957 0.2452784 0.9272012
C-A -0.061063140 -0.4026635 0.2805372 0.9551663
D-A 0.126520253 -0.2624089 0.5154494 0.7890519
C-B 0.005495482 -0.2848090 0.2958000 0.9999404
D-B 0.193078874 -0.1516699 0.5378277 0.4049948
D-C 0.187583392 -0.1843040 0.5594708 0.4923479
它能回答您的问题吗?
可重现的例子
structure(list(Row = 1:20, Rate = c(0.10791173, 0.0688004, 0.08644537,
0.29591635, 0.05081301, 0.026252, 0.80321285, 0.07682852, 0.18118814,
0.07787555, 0.15065913, 0.33362247, 0.03652168, 0.27998133, 0.05407451,
0.07233796, 0.08389061, 0.05220569, 0.05603418, 0.15875536),
Dogs = c(0, 0, 0, 0, 0, 0, 0.02974862, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), MHI_2018 = c(59338L, 57588L, 50412L,
29267L, 37365L, 45400L, 39917L, 42132L, 47547L, 39907L, 39201L,
46495L, 49055L, 42423L, 40203L, 39057L, 45796L, 34035L, 39491L,
60367L), Points = c(236.4064, 229.4343, 200.8446, 116.6016,
148.8645, 180.8765, 159.0319, 167.8566, 189.4303, 158.992,
156.1793, 185.239, 195.4382, 169.0159, 160.1713, 155.6056,
182.4542, 135.5976, 157.3347, 240.506), Level = c("C", "C",
"C", "A", "B", "D", "D", "D", "B", "B", "C", "B", "B", "A",
"A", "A", "B", "B", "B", "C"), AGE65_MORE = c(8653L, 44571L,
10548L, 1661L, 3995L, 20247L, 6562L, 5980L, 7411L, 2953L,
2751L, 2915L, 10914L, 2481L, 7790L, 2629L, 15446L, 6921L,
12322L, 12400L), P_Elderly = c(15.56267, 20.44335, 18.23651,
16.3839, 20.2998, 17.71748, 19.52105, 22.97173, 16.78482,
22.99665, 20.72316, 19.45019, 19.92988, 23.15446, 21.06202,
19.01765, 18.51106, 18.06578, 17.26133, 15.14282)), row.names = c(NA,
-20L), class = c("data.table", "data.frame"))
您可以拟合线性模型,其中速率由级别解释:
fit0 = lm(Rate ~ Level,data=df)
可以看看系数:
coefs = coefficients(fit0)
coefs
(Intercept) LevelB LevelC LevelD
0.17557754 -0.06655862 -0.06106314 0.12652025
这里以A为参考,系数表示它们的均值与A的均值相差多少。所以我们可以测试Level B : D是否为零,即一个公共截距对这个模型足够了:
library(car)
linearHypothesis(fit0,names(coefs)[-1],test="F")
Linear hypothesis test
Hypothesis:
LevelB = 0
LevelC = 0
LevelD = 0
Model 1: restricted model
Model 2: Rate ~ Level
Res.Df RSS Df Sum of Sq F Pr(>F)
1 19 0.59848
2 16 0.50688 3 0.091608 0.9639 0.4338
这类似于方差分析,您可以一次性测试所有水平系数的显着性。
anova(fit0)
Analysis of Variance Table
Response: Rate
Df Sum Sq Mean Sq F value Pr(>F)
Level 3 0.09161 0.030536 0.9639 0.4338
Residuals 16 0.50688 0.031680
按照上面的方法,很可能方法没有太大的不同。您也可以像这样进行成对测试:
library(multcomp)
summary(glht(fit0,linfct = mcp(Level = "Tukey")))
对于你的下一个问题,如何添加协变量,你将拟合另一个模型:
fit_full = lm(Rate ~ Level+P_Elderly,data=df)
并将其与只有级别的模型进行比较:
anova(f0,fit_full)
Analysis of Variance Table
Model 1: Rate ~ Level
Model 2: Rate ~ Level + P_Elderly
Res.Df RSS Df Sum of Sq F Pr(>F)
1 16 0.50688
2 15 0.50150 1 0.0053721 0.1607 0.6942
反对,似乎老人没有太大的影响..