关于配对 t 检验中的顺序,R 假设什么?
What does R assume regarding order in paired t-test?
在 t.test
函数 t.test(x, y, paired=T)
的非公式签名中,我假设数据被假定为按照两个输入(文档中的 x 和 y)中的顺序配对。
但是,在公式签名 t.test(values ~ groups, df, paired=T)
中,该函数如何将两组中的观察值成对关联起来?按顺序?
在下面的 reprex 中,我创建了一个包含前后配对数据的数据框。然后我以两种方式将其以长格式(适合 t.test
函数)放置:1) 按观察顺序列出“之前”组,然后按观察顺序列出“之后”组。 2) 不分先后顺序列出所有数据。
I 运行 对两个数据集进行配对 t 检验。很明显,在情况 2 中,函数绝对无法知道哪个“之后”观察与哪个“之前”观察相伴。我可以假设 t.test
函数理解情况 1 中输入的数据,即“之前”和“之后”数据都按观察顺序排列吗?
我在文档或任何在线示例中找不到任何相关信息。因为没有关于链接两组观察的键的参数,所以 t.test
函数正在做某种假设。
library(tidyverse)
df = data.frame(
observation = 1:20,
before = rnorm(20, 10, 2),
after = rnorm(20, 10.2, 2.3)
)
print.data.frame(df)
#> observation before after
#> 1 1 10.930157 11.818216
#> 2 2 10.870749 10.699232
#> 3 3 9.603120 14.384484
#> 4 4 9.615291 8.777045
#> 5 5 6.714043 9.506421
#> 6 6 9.063117 5.574887
#> 7 7 8.152260 10.357455
#> 8 8 8.256237 8.660646
#> 9 9 12.641977 7.511760
#> 10 10 11.010290 9.391047
#> 11 11 12.545197 9.072856
#> 12 12 12.606526 9.110687
#> 13 13 8.659088 12.445071
#> 14 14 8.958959 10.783168
#> 15 15 11.635443 6.926802
#> 16 16 6.922437 12.419453
#> 17 17 10.326176 10.416757
#> 18 18 7.680960 9.836573
#> 19 19 9.458365 8.083777
#> 20 20 7.235837 12.094290
df_long =
df %>%
pivot_longer(
cols = c("before", "after"),
names_to = "time",
values_to="fabulousness"
)
print.data.frame(df_long)
#> observation time fabulousness
#> 1 1 before 10.930157
#> 2 1 after 11.818216
#> 3 2 before 10.870749
#> 4 2 after 10.699232
#> 5 3 before 9.603120
#> 6 3 after 14.384484
#> 7 4 before 9.615291
#> 8 4 after 8.777045
#> 9 5 before 6.714043
#> 10 5 after 9.506421
#> 11 6 before 9.063117
#> 12 6 after 5.574887
#> 13 7 before 8.152260
#> 14 7 after 10.357455
#> 15 8 before 8.256237
#> 16 8 after 8.660646
#> 17 9 before 12.641977
#> 18 9 after 7.511760
#> 19 10 before 11.010290
#> 20 10 after 9.391047
#> 21 11 before 12.545197
#> 22 11 after 9.072856
#> 23 12 before 12.606526
#> 24 12 after 9.110687
#> 25 13 before 8.659088
#> 26 13 after 12.445071
#> 27 14 before 8.958959
#> 28 14 after 10.783168
#> 29 15 before 11.635443
#> 30 15 after 6.926802
#> 31 16 before 6.922437
#> 32 16 after 12.419453
#> 33 17 before 10.326176
#> 34 17 after 10.416757
#> 35 18 before 7.680960
#> 36 18 after 9.836573
#> 37 19 before 9.458365
#> 38 19 after 8.083777
#> 39 20 before 7.235837
#> 40 20 after 12.094290
df_long_not_paired =
df_long %>%
arrange(fabulousness)
print.data.frame(df_long_not_paired)
#> observation time fabulousness
#> 1 6 after 5.574887
#> 2 5 before 6.714043
#> 3 16 before 6.922437
#> 4 15 after 6.926802
#> 5 20 before 7.235837
#> 6 9 after 7.511760
#> 7 18 before 7.680960
#> 8 19 after 8.083777
#> 9 7 before 8.152260
#> 10 8 before 8.256237
#> 11 13 before 8.659088
#> 12 8 after 8.660646
#> 13 4 after 8.777045
#> 14 14 before 8.958959
#> 15 6 before 9.063117
#> 16 11 after 9.072856
#> 17 12 after 9.110687
#> 18 10 after 9.391047
#> 19 19 before 9.458365
#> 20 5 after 9.506421
#> 21 3 before 9.603120
#> 22 4 before 9.615291
#> 23 18 after 9.836573
#> 24 17 before 10.326176
#> 25 7 after 10.357455
#> 26 17 after 10.416757
#> 27 2 after 10.699232
#> 28 14 after 10.783168
#> 29 2 before 10.870749
#> 30 1 before 10.930157
#> 31 10 before 11.010290
#> 32 15 before 11.635443
#> 33 1 after 11.818216
#> 34 20 after 12.094290
#> 35 16 after 12.419453
#> 36 13 after 12.445071
#> 37 11 before 12.545197
#> 38 12 before 12.606526
#> 39 9 before 12.641977
#> 40 3 after 14.384484
df_long_paired =
df_long %>%
arrange(desc(time))
print.data.frame(df_long_paired)
#> observation time fabulousness
#> 1 1 before 10.930157
#> 2 2 before 10.870749
#> 3 3 before 9.603120
#> 4 4 before 9.615291
#> 5 5 before 6.714043
#> 6 6 before 9.063117
#> 7 7 before 8.152260
#> 8 8 before 8.256237
#> 9 9 before 12.641977
#> 10 10 before 11.010290
#> 11 11 before 12.545197
#> 12 12 before 12.606526
#> 13 13 before 8.659088
#> 14 14 before 8.958959
#> 15 15 before 11.635443
#> 16 16 before 6.922437
#> 17 17 before 10.326176
#> 18 18 before 7.680960
#> 19 19 before 9.458365
#> 20 20 before 7.235837
#> 21 1 after 11.818216
#> 22 2 after 10.699232
#> 23 3 after 14.384484
#> 24 4 after 8.777045
#> 25 5 after 9.506421
#> 26 6 after 5.574887
#> 27 7 after 10.357455
#> 28 8 after 8.660646
#> 29 9 after 7.511760
#> 30 10 after 9.391047
#> 31 11 after 9.072856
#> 32 12 after 9.110687
#> 33 13 after 12.445071
#> 34 14 after 10.783168
#> 35 15 after 6.926802
#> 36 16 after 12.419453
#> 37 17 after 10.416757
#> 38 18 after 9.836573
#> 39 19 after 8.083777
#> 40 20 after 12.094290
df_long_not_paired %>%
t.test(fabulousness ~ time, ., paired=T)
#>
#> Paired t-test
#>
#> data: fabulousness by time
#> t = 2.0289, df = 19, p-value = 0.05672
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -0.007878376 0.506318062
#> sample estimates:
#> mean of the differences
#> 0.2492198
df_long_paired %>%
t.test(fabulousness ~ time, ., paired=T)
#>
#> Paired t-test
#>
#> data: fabulousness by time
#> t = 0.3422, df = 19, p-value = 0.736
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -1.27509 1.77353
#> sample estimates:
#> mean of the differences
#> 0.2492198
由 reprex package (v0.3.0)
于 2020-11-24 创建
注意:
当我多次运行这样的时候,我经常看到乱序的误报。
所以要弄清楚这是怎么做到的,我们可以看看源代码。
stats:::t.test.formula
给我们:
g <- factor(mf[[-response]])
其中 mf
是模型框架,response
是响应变量。 g
然后是您的公式(LHS)中的分组变量。
然后,稍后,我们看到创建了一个对象DATA
,它是mf
在分组变量g
的基础上的分裂。然后将此数据传递给 stats:::t.test.default
,而不更改顺序。
DATA <- setNames(split(mf[[response]], g), c("x", "y"))
然后我们可以查看 stats:::t.test.default
,重点关注提到 paired
数据的地方。
if (paired) {
x <- x - y
y <- NULL
}
nx <- length(x)
mx <- mean(x)
vx <- var(x)
从这里我们看到 t.test.default
只是计算对之间的差异,然后对差异进行单样本 t 检验。
综上所述,我们了解到观察的顺序必须正确才能得到正确的对。
补充一下@BrianLang 对代码的解释,配对测试是测试你的样本之间的差异,它按行顺序计算差异。您可以通过以下方式验证这一点:
set.seed(111)
df = data.frame(
observation = 1:20,
before = rnorm(20, 10, 2),
after = rnorm(20, 10.2, 2.3)
)
t.test(x=df$after,y=df$before,paired=TRUE)
Paired t-test
data: df$after and df$before
t = 0.30475, df = 19, p-value = 0.7639
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.505079 2.018057
sample estimates:
mean of the differences
0.2564887
如果我们用长数据来做:
df_long =
df %>%
pivot_longer(
cols = c("before", "after"),
names_to = "time",
values_to="fabulousness"
)
t.test(fabulousness ~ time,paired=TRUE,data=df_long)
Paired t-test
data: fabulousness by time
t = 0.30475, df = 19, p-value = 0.7639
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.505079 2.018057
sample estimates:
mean of the differences
0.2564887
我通常会使用第一个公式来避免所有这些混淆。
在 t.test
函数 t.test(x, y, paired=T)
的非公式签名中,我假设数据被假定为按照两个输入(文档中的 x 和 y)中的顺序配对。
但是,在公式签名 t.test(values ~ groups, df, paired=T)
中,该函数如何将两组中的观察值成对关联起来?按顺序?
在下面的 reprex 中,我创建了一个包含前后配对数据的数据框。然后我以两种方式将其以长格式(适合 t.test
函数)放置:1) 按观察顺序列出“之前”组,然后按观察顺序列出“之后”组。 2) 不分先后顺序列出所有数据。
I 运行 对两个数据集进行配对 t 检验。很明显,在情况 2 中,函数绝对无法知道哪个“之后”观察与哪个“之前”观察相伴。我可以假设 t.test
函数理解情况 1 中输入的数据,即“之前”和“之后”数据都按观察顺序排列吗?
我在文档或任何在线示例中找不到任何相关信息。因为没有关于链接两组观察的键的参数,所以 t.test
函数正在做某种假设。
library(tidyverse)
df = data.frame(
observation = 1:20,
before = rnorm(20, 10, 2),
after = rnorm(20, 10.2, 2.3)
)
print.data.frame(df)
#> observation before after
#> 1 1 10.930157 11.818216
#> 2 2 10.870749 10.699232
#> 3 3 9.603120 14.384484
#> 4 4 9.615291 8.777045
#> 5 5 6.714043 9.506421
#> 6 6 9.063117 5.574887
#> 7 7 8.152260 10.357455
#> 8 8 8.256237 8.660646
#> 9 9 12.641977 7.511760
#> 10 10 11.010290 9.391047
#> 11 11 12.545197 9.072856
#> 12 12 12.606526 9.110687
#> 13 13 8.659088 12.445071
#> 14 14 8.958959 10.783168
#> 15 15 11.635443 6.926802
#> 16 16 6.922437 12.419453
#> 17 17 10.326176 10.416757
#> 18 18 7.680960 9.836573
#> 19 19 9.458365 8.083777
#> 20 20 7.235837 12.094290
df_long =
df %>%
pivot_longer(
cols = c("before", "after"),
names_to = "time",
values_to="fabulousness"
)
print.data.frame(df_long)
#> observation time fabulousness
#> 1 1 before 10.930157
#> 2 1 after 11.818216
#> 3 2 before 10.870749
#> 4 2 after 10.699232
#> 5 3 before 9.603120
#> 6 3 after 14.384484
#> 7 4 before 9.615291
#> 8 4 after 8.777045
#> 9 5 before 6.714043
#> 10 5 after 9.506421
#> 11 6 before 9.063117
#> 12 6 after 5.574887
#> 13 7 before 8.152260
#> 14 7 after 10.357455
#> 15 8 before 8.256237
#> 16 8 after 8.660646
#> 17 9 before 12.641977
#> 18 9 after 7.511760
#> 19 10 before 11.010290
#> 20 10 after 9.391047
#> 21 11 before 12.545197
#> 22 11 after 9.072856
#> 23 12 before 12.606526
#> 24 12 after 9.110687
#> 25 13 before 8.659088
#> 26 13 after 12.445071
#> 27 14 before 8.958959
#> 28 14 after 10.783168
#> 29 15 before 11.635443
#> 30 15 after 6.926802
#> 31 16 before 6.922437
#> 32 16 after 12.419453
#> 33 17 before 10.326176
#> 34 17 after 10.416757
#> 35 18 before 7.680960
#> 36 18 after 9.836573
#> 37 19 before 9.458365
#> 38 19 after 8.083777
#> 39 20 before 7.235837
#> 40 20 after 12.094290
df_long_not_paired =
df_long %>%
arrange(fabulousness)
print.data.frame(df_long_not_paired)
#> observation time fabulousness
#> 1 6 after 5.574887
#> 2 5 before 6.714043
#> 3 16 before 6.922437
#> 4 15 after 6.926802
#> 5 20 before 7.235837
#> 6 9 after 7.511760
#> 7 18 before 7.680960
#> 8 19 after 8.083777
#> 9 7 before 8.152260
#> 10 8 before 8.256237
#> 11 13 before 8.659088
#> 12 8 after 8.660646
#> 13 4 after 8.777045
#> 14 14 before 8.958959
#> 15 6 before 9.063117
#> 16 11 after 9.072856
#> 17 12 after 9.110687
#> 18 10 after 9.391047
#> 19 19 before 9.458365
#> 20 5 after 9.506421
#> 21 3 before 9.603120
#> 22 4 before 9.615291
#> 23 18 after 9.836573
#> 24 17 before 10.326176
#> 25 7 after 10.357455
#> 26 17 after 10.416757
#> 27 2 after 10.699232
#> 28 14 after 10.783168
#> 29 2 before 10.870749
#> 30 1 before 10.930157
#> 31 10 before 11.010290
#> 32 15 before 11.635443
#> 33 1 after 11.818216
#> 34 20 after 12.094290
#> 35 16 after 12.419453
#> 36 13 after 12.445071
#> 37 11 before 12.545197
#> 38 12 before 12.606526
#> 39 9 before 12.641977
#> 40 3 after 14.384484
df_long_paired =
df_long %>%
arrange(desc(time))
print.data.frame(df_long_paired)
#> observation time fabulousness
#> 1 1 before 10.930157
#> 2 2 before 10.870749
#> 3 3 before 9.603120
#> 4 4 before 9.615291
#> 5 5 before 6.714043
#> 6 6 before 9.063117
#> 7 7 before 8.152260
#> 8 8 before 8.256237
#> 9 9 before 12.641977
#> 10 10 before 11.010290
#> 11 11 before 12.545197
#> 12 12 before 12.606526
#> 13 13 before 8.659088
#> 14 14 before 8.958959
#> 15 15 before 11.635443
#> 16 16 before 6.922437
#> 17 17 before 10.326176
#> 18 18 before 7.680960
#> 19 19 before 9.458365
#> 20 20 before 7.235837
#> 21 1 after 11.818216
#> 22 2 after 10.699232
#> 23 3 after 14.384484
#> 24 4 after 8.777045
#> 25 5 after 9.506421
#> 26 6 after 5.574887
#> 27 7 after 10.357455
#> 28 8 after 8.660646
#> 29 9 after 7.511760
#> 30 10 after 9.391047
#> 31 11 after 9.072856
#> 32 12 after 9.110687
#> 33 13 after 12.445071
#> 34 14 after 10.783168
#> 35 15 after 6.926802
#> 36 16 after 12.419453
#> 37 17 after 10.416757
#> 38 18 after 9.836573
#> 39 19 after 8.083777
#> 40 20 after 12.094290
df_long_not_paired %>%
t.test(fabulousness ~ time, ., paired=T)
#>
#> Paired t-test
#>
#> data: fabulousness by time
#> t = 2.0289, df = 19, p-value = 0.05672
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -0.007878376 0.506318062
#> sample estimates:
#> mean of the differences
#> 0.2492198
df_long_paired %>%
t.test(fabulousness ~ time, ., paired=T)
#>
#> Paired t-test
#>
#> data: fabulousness by time
#> t = 0.3422, df = 19, p-value = 0.736
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -1.27509 1.77353
#> sample estimates:
#> mean of the differences
#> 0.2492198
由 reprex package (v0.3.0)
于 2020-11-24 创建注意:
当我多次运行这样的时候,我经常看到乱序的误报。
所以要弄清楚这是怎么做到的,我们可以看看源代码。
stats:::t.test.formula
给我们:
g <- factor(mf[[-response]])
其中 mf
是模型框架,response
是响应变量。 g
然后是您的公式(LHS)中的分组变量。
然后,稍后,我们看到创建了一个对象DATA
,它是mf
在分组变量g
的基础上的分裂。然后将此数据传递给 stats:::t.test.default
,而不更改顺序。
DATA <- setNames(split(mf[[response]], g), c("x", "y"))
然后我们可以查看 stats:::t.test.default
,重点关注提到 paired
数据的地方。
if (paired) { x <- x - y y <- NULL } nx <- length(x) mx <- mean(x) vx <- var(x)
从这里我们看到 t.test.default
只是计算对之间的差异,然后对差异进行单样本 t 检验。
综上所述,我们了解到观察的顺序必须正确才能得到正确的对。
补充一下@BrianLang 对代码的解释,配对测试是测试你的样本之间的差异,它按行顺序计算差异。您可以通过以下方式验证这一点:
set.seed(111)
df = data.frame(
observation = 1:20,
before = rnorm(20, 10, 2),
after = rnorm(20, 10.2, 2.3)
)
t.test(x=df$after,y=df$before,paired=TRUE)
Paired t-test
data: df$after and df$before
t = 0.30475, df = 19, p-value = 0.7639
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.505079 2.018057
sample estimates:
mean of the differences
0.2564887
如果我们用长数据来做:
df_long =
df %>%
pivot_longer(
cols = c("before", "after"),
names_to = "time",
values_to="fabulousness"
)
t.test(fabulousness ~ time,paired=TRUE,data=df_long)
Paired t-test
data: fabulousness by time
t = 0.30475, df = 19, p-value = 0.7639
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.505079 2.018057
sample estimates:
mean of the differences
0.2564887
我通常会使用第一个公式来避免所有这些混淆。