在 R 中使用 combn() 查找所有可能的 t 检验关系，如何访问比较的变量？

Question

所以，我有一个包含大量变量的 DataFrame，我想用 t 检验交叉检查每个变量。

我的数据样本，名为 trust_news:

row	mean	polity2	web	rsf	civil_liberties	freedom_of_expression	vdem_gov_censorship_effort	vdem_self_censorship_effort	vdem_freedom_of_expression	ciri_freedom_of_speech_and_press	media_integrity	vdem_critical_press	vdem_media_perspective	vdem_media_bias	vdem_media_corruption	vdem_media_freedom
1	2.68	8	87.2661	25.69	0.785599008	0.758906967	0.731895466	0.742219428	1	1	0.81449235	0.889046047	0.782079459	0.693825991	0.733503755	1
2	2.8	8	94.8967	22.23	0.810742702	0.832891911	0.8447733	0.831499528	1	1	0.88417386	0.868772592	0.881994928	0.835622928	0.828566864	1
3	3.22	10	89.7391	14.6	0.821268417	0.83327835	0.883343829	0.805721471	1	1	0.829951651	0.917491749	0.725950972	0.709774199	0.874261064	1
5	2.96	10	74.3872	24.98	0.813949794	0.781986225	0.844615869	0.729330399	0.666666667	0.5	0.878769429	0.872387239	0.919019442	0.841939049	0.810193322	0.5

然后，我运行这段代码就可以了：

trust_news_combos <- combn(trust_news, 1, t.test, simplify = TRUE)

首先，代码是否正确？我不知道在 combn() 函数中为 m 放什么。 AAanyway，那条线给了我这个：

	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12	V13	V14	V15	V16
1	c(t = 85.1670166474227)	c(t = 15.9614095646055)	c(t = 29.2365516170159)	c(t = 11.0778062107689)	c(t = 30.4673329981756)	c(t = 26.8521522144486)	c(t = 23.160185720972)	c(t = 25.1063414199952)	c(t = 17.1830959329723)	c(t = 11.06502519693)	c(t = 33.0841916129404)	c(t = 29.3707961673045)	c(t = 31.2455551028106)	c(t = 39.1490231250879)	c(t = 27.6089179039943)	c(t = 14.0719508946058)
2	c(df = 32)	c(df = 32)	c(df = 32)	c(df = 32)	c(df = 32)	c(df = 32)	c(df = 32)	c(df = 32)	c(df = 32)	c(df = 32)	c(df = 32)	c(df = 32)	c(df = 32)	c(df = 32)	c(df = 32)	c(df = 32)
3	2.69E-39	8.55E-17	1.18E-24	1.75E-12	3.29E-25	1.61E-23	1.46E-21	1.26E-22	1.03E-17	1.80E-12	2.55E-26	1.02E-24	1.51E-25	1.32E-28	6.88E-24	2.96E-15
4	c(3.00189912275063	3.14900996815846)	c(7.56066019283154	9.77267314050179)	c(73.5097801046279	84.5198259559781)	c(19.628297122971	28.4729149982411)	c(0.682586494865725	0.780396107679729)	c(0.639468676034051	0.744449016935646)	c(0.664192511270674	0.792289818305084)	c(0.665160025455844	0.782621785210823)	c(0.676674167771883	0.858679367682662)	c(0.543941635486123	0.78939169784721)	c(0.739756992152986	0.836824222392469)	c(0.730937293702635	0.839876930600395)	c(0.729509614919607	0.831257822777363)	c(0.709894349786553	0.787820841122538)	c(0.708427672557418	0.821287114048642)	c(0.647915673315896	0.867235841835619)
5	c(`mean of x` = 3.07545454545455)	c(`mean of x` = 8.66666666666667)	c(`mean of x` = 79.014803030303)	c(`mean of x` = 24.0506060606061)	c(`mean of x` = 0.731491301272727)	c(`mean of x` = 0.691958846484849)	c(`mean of x` = 0.728241164787879)	c(`mean of x` = 0.723890905333333)	c(`mean of x` = 0.767676767727273)	c(`mean of x` = 0.666666666666667)	c(`mean of x` = 0.788290607272727)	c(`mean of x` = 0.785407112151515)	c(`mean of x` = 0.780383718848485)	c(`mean of x` = 0.748857595454545)	c(`mean of x` = 0.76485739330303)	c(`mean of x` = 0.757575757575758)
6	c(mean = 0)	c(mean = 0)	c(mean = 0)	c(mean = 0)	c(mean = 0)	c(mean = 0)	c(mean = 0)	c(mean = 0)	c(mean = 0)	c(mean = 0)	c(mean = 0)	c(mean = 0)	c(mean = 0)	c(mean = 0)	c(mean = 0)	c(mean = 0)
7	0.036110864	0.542976272	2.702603374	2.171062176	0.024009036	0.025769214	0.031443667	0.028832991	0.044676278	0.0602499	0.023826806	0.02674109	0.024975831	0.019128385	0.027703273	0.053835873
8	two.sided	two.sided	two.sided	two.sided	two.sided	two.sided	two.sided	two.sided	two.sided	two.sided	two.sided	two.sided	two.sided	two.sided	two.sided	two.sided
9	One Sample t-test	One Sample t-test	One Sample t-test	One Sample t-test	One Sample t-test	One Sample t-test	One Sample t-test	One Sample t-test	One Sample t-test	One Sample t-test	One Sample t-test	One Sample t-test	One Sample t-test	One Sample t-test	One Sample t-test	One Sample t-test
10	x[a]	x[a]	x[a]	x[a]	x[a]	x[a]	x[a]	x[a]	x[a]	x[a]	x[a]	x[a]	x[a]	x[a]	x[a]	x[a]

它为我提供了第 3 行中要查找的 p 值，但如何检查正在检查的是哪两列？

感谢任何帮助，并将在我的最终代码中表示感谢！

Answer 1

您应该编写一个小函数来准确计算您需要的内容，并使用它代替标准函数 t.test。例如：

# get four column names
cols <- names(mtcars)[1:4]   # use trust_news instead of mtcars, and keep all the names

# compute the pval for a pair of names
pval <- function(pair) {
  value <- t.test(mtcars[, pair[1]], mtcars[, pair[2]])$p.value
  names(value) <- paste(pair, collapse = " vs. ")
  value
}

# do it for all pairs.  Don't simplify, and it will keep the names
combn(cols, 2, pval, simplify = FALSE)
#> [[1]]
#>  mpg vs. cyl 
#> 9.507708e-15 
#> 
#> [[2]]
#> mpg vs. disp 
#> 7.978234e-11 
#> 
#> [[3]]
#>   mpg vs. hp 
#> 1.030354e-11 
#> 
#> [[4]]
#> cyl vs. disp 
#> 1.774454e-11 
#> 
#> [[5]]
#>   cyl vs. hp 
#> 8.321996e-13 
#> 
#> [[6]]
#> disp vs. hp 
#> 0.001545647

^{由 reprex package (v2.0.0)}

于 2021-05-22 创建

Answer 2

一种方法是在列名

上创建第二个combn

nm1 <-  combn(names(trust_news), 2, FUN = paste, collapse= '-', simplify = TRUE)

然后，我们做

trust_news_combos <- combn(trust_news, 2, t.test, simplify = FALSE)
names(trust_new_combos) <- nm1

最好在数据中获取输出。frame/tibble 结构 broom 使用 tidy（运行 in R 4.1.0）

library(broom)
lst1 <- combn(trust_news, 2, \(y) t.test(y[1], y[2]) |>
                             tidy(), simplify = FALSE) |>
         setNames(nm1)

out <- Map(cbind, comparison = names(lst1), lst1) |>
       {\(x) do.call(rbind, x)}()
row.names(out) <- NULL

-输出

head(out)
                 comparison   estimate estimate1  estimate2   statistic      p.value parameter    conf.low
1                  row-mean  -0.165000      2.75  2.9150000  -0.1914478 0.8599889461  3.112075  -2.8527609
2               row-polity2  -6.250000      2.75  9.0000000  -6.0633906 0.0014638846  5.268737  -8.8595564
3                   row-web -83.822275      2.75 86.5722750 -18.8602012 0.0002049939  3.229641 -97.4140679
4                   row-rsf -19.125000      2.75 21.8750000  -7.1441517 0.0027953086  3.671029 -26.8277783
5       row-civil_liberties   1.942110      2.75  0.8078900   2.2742727 0.1074862571  3.000494  -0.7752796
6 row-freedom_of_expression   1.948234      2.75  0.8017659   2.2809921 0.1067532047  3.002873  -0.7684766
   conf.high                  method alternative
1   2.522761 Welch Two Sample t-test   two.sided
2  -3.640444 Welch Two Sample t-test   two.sided
3 -70.230482 Welch Two Sample t-test   two.sided
4 -11.422222 Welch Two Sample t-test   two.sided
5   4.659500 Welch Two Sample t-test   two.sided
6   4.664945 Welch Two Sample t-test   two.sided

数据

trust_news <- structure(list(row = c(1L, 2L, 3L, 5L), mean = c(2.68, 2.8, 3.22, 
2.96), polity2 = c(8L, 8L, 10L, 10L), web = c(87.2661, 94.8967, 
89.7391, 74.3872), rsf = c(25.69, 22.23, 14.6, 24.98), civil_liberties = c(0.785599008, 
0.810742702, 0.821268417, 0.813949794), freedom_of_expression = c(0.758906967, 
0.832891911, 0.83327835, 0.781986225), vdem_gov_censorship_effort = c(0.731895466, 
0.8447733, 0.883343829, 0.844615869), vdem_self_censorship_effort = c(0.742219428, 
0.831499528, 0.805721471, 0.729330399), vdem_freedom_of_expression = c(1, 
1, 1, 0.666666667), ciri_freedom_of_speech_and_press = c(1, 1, 
1, 0.5), media_integrity = c(0.81449235, 0.88417386, 0.829951651, 
0.878769429), vdem_critical_press = c(0.889046047, 0.868772592, 
0.917491749, 0.872387239), vdem_media_perspective = c(0.782079459, 
0.881994928, 0.725950972, 0.919019442), vdem_media_bias = c(0.693825991, 
0.835622928, 0.709774199, 0.841939049), vdem_media_corruption = c(0.733503755, 
0.828566864, 0.874261064, 0.810193322), vdem_media_freedom = c(1, 
1, 1, 0.5)), class = "data.frame", row.names = c(NA, -4L))

在 R 中使用 combn() 查找所有可能的 t 检验关系，如何访问比较的变量？

Using combn() in R to find all possible t-test relationships, how to access the variables compared?

r

dataframe

combn

t-test

数据