关于配对 t 检验中的顺序,R 假设什么?

What does R assume regarding order in paired t-test?

t.test 函数 t.test(x, y, paired=T) 的非公式签名中,我假设数据被假定为按照两个输入(文档中的 x 和 y)中的顺序配对。

但是,在公式签名 t.test(values ~ groups, df, paired=T) 中,该函数如何将两组中的观察值成对关联起来?按顺序?

在下面的 reprex 中,我创建了一个包含前后配对数据的数据框。然后我以两种方式将其以长格式(适合 t.test 函数)放置:1) 按观察顺序列出“之前”组,然后按观察顺序列出“之后”组。 2) 不分先后顺序列出所有数据。

I 运行 对两个数据集进行配对 t 检验。很明显,在情况 2 中,函数绝对无法知道哪个“之后”观察与哪个“之前”观察相伴。我可以假设 t.test 函数理解情况 1 中输入的数据,即“之前”和“之后”数据都按观察顺序排列吗?

我在文档或任何在线示例中找不到任何相关信息。因为没有关于链接两组观察的键的参数,所以 t.test 函数正在做某种假设。

library(tidyverse)

df = data.frame(
  observation = 1:20,
  before = rnorm(20, 10, 2),
  after = rnorm(20, 10.2, 2.3)
)

print.data.frame(df)
#>    observation    before     after
#> 1            1 10.930157 11.818216
#> 2            2 10.870749 10.699232
#> 3            3  9.603120 14.384484
#> 4            4  9.615291  8.777045
#> 5            5  6.714043  9.506421
#> 6            6  9.063117  5.574887
#> 7            7  8.152260 10.357455
#> 8            8  8.256237  8.660646
#> 9            9 12.641977  7.511760
#> 10          10 11.010290  9.391047
#> 11          11 12.545197  9.072856
#> 12          12 12.606526  9.110687
#> 13          13  8.659088 12.445071
#> 14          14  8.958959 10.783168
#> 15          15 11.635443  6.926802
#> 16          16  6.922437 12.419453
#> 17          17 10.326176 10.416757
#> 18          18  7.680960  9.836573
#> 19          19  9.458365  8.083777
#> 20          20  7.235837 12.094290

df_long = 
  df %>% 
  pivot_longer(
    cols = c("before", "after"),
    names_to = "time", 
    values_to="fabulousness"
  )

print.data.frame(df_long)
#>    observation   time fabulousness
#> 1            1 before    10.930157
#> 2            1  after    11.818216
#> 3            2 before    10.870749
#> 4            2  after    10.699232
#> 5            3 before     9.603120
#> 6            3  after    14.384484
#> 7            4 before     9.615291
#> 8            4  after     8.777045
#> 9            5 before     6.714043
#> 10           5  after     9.506421
#> 11           6 before     9.063117
#> 12           6  after     5.574887
#> 13           7 before     8.152260
#> 14           7  after    10.357455
#> 15           8 before     8.256237
#> 16           8  after     8.660646
#> 17           9 before    12.641977
#> 18           9  after     7.511760
#> 19          10 before    11.010290
#> 20          10  after     9.391047
#> 21          11 before    12.545197
#> 22          11  after     9.072856
#> 23          12 before    12.606526
#> 24          12  after     9.110687
#> 25          13 before     8.659088
#> 26          13  after    12.445071
#> 27          14 before     8.958959
#> 28          14  after    10.783168
#> 29          15 before    11.635443
#> 30          15  after     6.926802
#> 31          16 before     6.922437
#> 32          16  after    12.419453
#> 33          17 before    10.326176
#> 34          17  after    10.416757
#> 35          18 before     7.680960
#> 36          18  after     9.836573
#> 37          19 before     9.458365
#> 38          19  after     8.083777
#> 39          20 before     7.235837
#> 40          20  after    12.094290

df_long_not_paired = 
  df_long %>% 
  arrange(fabulousness)

print.data.frame(df_long_not_paired)
#>    observation   time fabulousness
#> 1            6  after     5.574887
#> 2            5 before     6.714043
#> 3           16 before     6.922437
#> 4           15  after     6.926802
#> 5           20 before     7.235837
#> 6            9  after     7.511760
#> 7           18 before     7.680960
#> 8           19  after     8.083777
#> 9            7 before     8.152260
#> 10           8 before     8.256237
#> 11          13 before     8.659088
#> 12           8  after     8.660646
#> 13           4  after     8.777045
#> 14          14 before     8.958959
#> 15           6 before     9.063117
#> 16          11  after     9.072856
#> 17          12  after     9.110687
#> 18          10  after     9.391047
#> 19          19 before     9.458365
#> 20           5  after     9.506421
#> 21           3 before     9.603120
#> 22           4 before     9.615291
#> 23          18  after     9.836573
#> 24          17 before    10.326176
#> 25           7  after    10.357455
#> 26          17  after    10.416757
#> 27           2  after    10.699232
#> 28          14  after    10.783168
#> 29           2 before    10.870749
#> 30           1 before    10.930157
#> 31          10 before    11.010290
#> 32          15 before    11.635443
#> 33           1  after    11.818216
#> 34          20  after    12.094290
#> 35          16  after    12.419453
#> 36          13  after    12.445071
#> 37          11 before    12.545197
#> 38          12 before    12.606526
#> 39           9 before    12.641977
#> 40           3  after    14.384484

df_long_paired = 
  df_long %>% 
  arrange(desc(time))

print.data.frame(df_long_paired)
#>    observation   time fabulousness
#> 1            1 before    10.930157
#> 2            2 before    10.870749
#> 3            3 before     9.603120
#> 4            4 before     9.615291
#> 5            5 before     6.714043
#> 6            6 before     9.063117
#> 7            7 before     8.152260
#> 8            8 before     8.256237
#> 9            9 before    12.641977
#> 10          10 before    11.010290
#> 11          11 before    12.545197
#> 12          12 before    12.606526
#> 13          13 before     8.659088
#> 14          14 before     8.958959
#> 15          15 before    11.635443
#> 16          16 before     6.922437
#> 17          17 before    10.326176
#> 18          18 before     7.680960
#> 19          19 before     9.458365
#> 20          20 before     7.235837
#> 21           1  after    11.818216
#> 22           2  after    10.699232
#> 23           3  after    14.384484
#> 24           4  after     8.777045
#> 25           5  after     9.506421
#> 26           6  after     5.574887
#> 27           7  after    10.357455
#> 28           8  after     8.660646
#> 29           9  after     7.511760
#> 30          10  after     9.391047
#> 31          11  after     9.072856
#> 32          12  after     9.110687
#> 33          13  after    12.445071
#> 34          14  after    10.783168
#> 35          15  after     6.926802
#> 36          16  after    12.419453
#> 37          17  after    10.416757
#> 38          18  after     9.836573
#> 39          19  after     8.083777
#> 40          20  after    12.094290


df_long_not_paired %>%
  t.test(fabulousness ~ time, ., paired=T)
#> 
#>  Paired t-test
#> 
#> data:  fabulousness by time
#> t = 2.0289, df = 19, p-value = 0.05672
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -0.007878376  0.506318062
#> sample estimates:
#> mean of the differences 
#>               0.2492198

df_long_paired %>% 
  t.test(fabulousness ~ time, ., paired=T)
#> 
#>  Paired t-test
#> 
#> data:  fabulousness by time
#> t = 0.3422, df = 19, p-value = 0.736
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -1.27509  1.77353
#> sample estimates:
#> mean of the differences 
#>               0.2492198

reprex package (v0.3.0)

于 2020-11-24 创建

注意:

当我多次运行这样的时候,我经常看到乱序的误报。

所以要弄清楚这是怎么做到的,我们可以看看源代码。

stats:::t.test.formula 给我们:

g <- factor(mf[[-response]])

其中 mf 是模型框架,response 是响应变量。 g 然后是您的公式(LHS)中的分组变量。 然后,稍后,我们看到创建了一个对象DATA,它是mf在分组变量g的基础上的分裂。然后将此数据传递给 stats:::t.test.default,而不更改顺序。

DATA <- setNames(split(mf[[response]], g), c("x", "y"))

然后我们可以查看 stats:::t.test.default,重点关注提到 paired 数据的地方。

if (paired) {
      x <- x - y
      y <- NULL
   }
nx <- length(x)
mx <- mean(x)
vx <- var(x) 

从这里我们看到 t.test.default 只是计算对之间的差异,然后对差异进行单样本 t 检验。

综上所述,我们了解到观察的顺序必须正确才能得到正确的对。

补充一下@BrianLang 对代码的解释,配对测试是测试你的样本之间的差异,它按行顺序计算差异。您可以通过以下方式验证这一点:

set.seed(111)

df = data.frame(
  observation = 1:20,
  before = rnorm(20, 10, 2),
  after = rnorm(20, 10.2, 2.3)
) 

t.test(x=df$after,y=df$before,paired=TRUE)

    Paired t-test

data:  df$after and df$before
t = 0.30475, df = 19, p-value = 0.7639
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.505079  2.018057
sample estimates:
mean of the differences 
              0.2564887 

如果我们用长数据来做:

df_long = 
  df %>% 
  pivot_longer(
    cols = c("before", "after"),
    names_to = "time", 
    values_to="fabulousness"
  )

t.test(fabulousness ~ time,paired=TRUE,data=df_long)

    Paired t-test

data:  fabulousness by time
t = 0.30475, df = 19, p-value = 0.7639
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.505079  2.018057
sample estimates:
mean of the differences 
              0.2564887 

我通常会使用第一个公式来避免所有这些混淆。