在 R 中使用 combn() 查找所有可能的 t 检验关系,如何访问比较的变量?
Using combn() in R to find all possible t-test relationships, how to access the variables compared?
所以,我有一个包含大量变量的 DataFrame,我想用 t 检验交叉检查每个变量。
我的数据样本,名为 trust_news
:
row
mean
polity2
web
rsf
civil_liberties
freedom_of_expression
vdem_gov_censorship_effort
vdem_self_censorship_effort
vdem_freedom_of_expression
ciri_freedom_of_speech_and_press
media_integrity
vdem_critical_press
vdem_media_perspective
vdem_media_bias
vdem_media_corruption
vdem_media_freedom
1
2.68
8
87.2661
25.69
0.785599008
0.758906967
0.731895466
0.742219428
1
1
0.81449235
0.889046047
0.782079459
0.693825991
0.733503755
1
2
2.8
8
94.8967
22.23
0.810742702
0.832891911
0.8447733
0.831499528
1
1
0.88417386
0.868772592
0.881994928
0.835622928
0.828566864
1
3
3.22
10
89.7391
14.6
0.821268417
0.83327835
0.883343829
0.805721471
1
1
0.829951651
0.917491749
0.725950972
0.709774199
0.874261064
1
5
2.96
10
74.3872
24.98
0.813949794
0.781986225
0.844615869
0.729330399
0.666666667
0.5
0.878769429
0.872387239
0.919019442
0.841939049
0.810193322
0.5
然后,我运行这段代码就可以了:
trust_news_combos <- combn(trust_news, 1, t.test, simplify = TRUE)
首先,代码是否正确?我不知道在 combn()
函数中为 m
放什么。 AAanyway,那条线给了我这个:
V1
V2
V3
V4
V5
V6
V7
V8
V9
V10
V11
V12
V13
V14
V15
V16
1
c(t = 85.1670166474227)
c(t = 15.9614095646055)
c(t = 29.2365516170159)
c(t = 11.0778062107689)
c(t = 30.4673329981756)
c(t = 26.8521522144486)
c(t = 23.160185720972)
c(t = 25.1063414199952)
c(t = 17.1830959329723)
c(t = 11.06502519693)
c(t = 33.0841916129404)
c(t = 29.3707961673045)
c(t = 31.2455551028106)
c(t = 39.1490231250879)
c(t = 27.6089179039943)
c(t = 14.0719508946058)
2
c(df = 32)
c(df = 32)
c(df = 32)
c(df = 32)
c(df = 32)
c(df = 32)
c(df = 32)
c(df = 32)
c(df = 32)
c(df = 32)
c(df = 32)
c(df = 32)
c(df = 32)
c(df = 32)
c(df = 32)
c(df = 32)
3
2.69E-39
8.55E-17
1.18E-24
1.75E-12
3.29E-25
1.61E-23
1.46E-21
1.26E-22
1.03E-17
1.80E-12
2.55E-26
1.02E-24
1.51E-25
1.32E-28
6.88E-24
2.96E-15
4
c(3.00189912275063
3.14900996815846)
c(7.56066019283154
9.77267314050179)
c(73.5097801046279
84.5198259559781)
c(19.628297122971
28.4729149982411)
c(0.682586494865725
0.780396107679729)
c(0.639468676034051
0.744449016935646)
c(0.664192511270674
0.792289818305084)
c(0.665160025455844
0.782621785210823)
c(0.676674167771883
0.858679367682662)
c(0.543941635486123
0.78939169784721)
c(0.739756992152986
0.836824222392469)
c(0.730937293702635
0.839876930600395)
c(0.729509614919607
0.831257822777363)
c(0.709894349786553
0.787820841122538)
c(0.708427672557418
0.821287114048642)
c(0.647915673315896
0.867235841835619)
5
c(mean of x
= 3.07545454545455)
c(mean of x
= 8.66666666666667)
c(mean of x
= 79.014803030303)
c(mean of x
= 24.0506060606061)
c(mean of x
= 0.731491301272727)
c(mean of x
= 0.691958846484849)
c(mean of x
= 0.728241164787879)
c(mean of x
= 0.723890905333333)
c(mean of x
= 0.767676767727273)
c(mean of x
= 0.666666666666667)
c(mean of x
= 0.788290607272727)
c(mean of x
= 0.785407112151515)
c(mean of x
= 0.780383718848485)
c(mean of x
= 0.748857595454545)
c(mean of x
= 0.76485739330303)
c(mean of x
= 0.757575757575758)
6
c(mean = 0)
c(mean = 0)
c(mean = 0)
c(mean = 0)
c(mean = 0)
c(mean = 0)
c(mean = 0)
c(mean = 0)
c(mean = 0)
c(mean = 0)
c(mean = 0)
c(mean = 0)
c(mean = 0)
c(mean = 0)
c(mean = 0)
c(mean = 0)
7
0.036110864
0.542976272
2.702603374
2.171062176
0.024009036
0.025769214
0.031443667
0.028832991
0.044676278
0.0602499
0.023826806
0.02674109
0.024975831
0.019128385
0.027703273
0.053835873
8
two.sided
two.sided
two.sided
two.sided
two.sided
two.sided
two.sided
two.sided
two.sided
two.sided
two.sided
two.sided
two.sided
two.sided
two.sided
two.sided
9
One Sample t-test
One Sample t-test
One Sample t-test
One Sample t-test
One Sample t-test
One Sample t-test
One Sample t-test
One Sample t-test
One Sample t-test
One Sample t-test
One Sample t-test
One Sample t-test
One Sample t-test
One Sample t-test
One Sample t-test
One Sample t-test
10
x[a]
x[a]
x[a]
x[a]
x[a]
x[a]
x[a]
x[a]
x[a]
x[a]
x[a]
x[a]
x[a]
x[a]
x[a]
x[a]
它为我提供了第 3 行中要查找的 p 值,但如何检查正在检查的是哪两列?
感谢任何帮助,并将在我的最终代码中表示感谢!
您应该编写一个小函数来准确计算您需要的内容,并使用它代替标准函数 t.test
。例如:
# get four column names
cols <- names(mtcars)[1:4] # use trust_news instead of mtcars, and keep all the names
# compute the pval for a pair of names
pval <- function(pair) {
value <- t.test(mtcars[, pair[1]], mtcars[, pair[2]])$p.value
names(value) <- paste(pair, collapse = " vs. ")
value
}
# do it for all pairs. Don't simplify, and it will keep the names
combn(cols, 2, pval, simplify = FALSE)
#> [[1]]
#> mpg vs. cyl
#> 9.507708e-15
#>
#> [[2]]
#> mpg vs. disp
#> 7.978234e-11
#>
#> [[3]]
#> mpg vs. hp
#> 1.030354e-11
#>
#> [[4]]
#> cyl vs. disp
#> 1.774454e-11
#>
#> [[5]]
#> cyl vs. hp
#> 8.321996e-13
#>
#> [[6]]
#> disp vs. hp
#> 0.001545647
由 reprex package (v2.0.0)
于 2021-05-22 创建
一种方法是在列名
上创建第二个combn
nm1 <- combn(names(trust_news), 2, FUN = paste, collapse= '-', simplify = TRUE)
然后,我们做
trust_news_combos <- combn(trust_news, 2, t.test, simplify = FALSE)
names(trust_new_combos) <- nm1
最好在数据中获取输出。frame/tibble 结构 broom
使用 tidy
(运行 in R 4.1.0
)
library(broom)
lst1 <- combn(trust_news, 2, \(y) t.test(y[1], y[2]) |>
tidy(), simplify = FALSE) |>
setNames(nm1)
out <- Map(cbind, comparison = names(lst1), lst1) |>
{\(x) do.call(rbind, x)}()
row.names(out) <- NULL
-输出
head(out)
comparison estimate estimate1 estimate2 statistic p.value parameter conf.low
1 row-mean -0.165000 2.75 2.9150000 -0.1914478 0.8599889461 3.112075 -2.8527609
2 row-polity2 -6.250000 2.75 9.0000000 -6.0633906 0.0014638846 5.268737 -8.8595564
3 row-web -83.822275 2.75 86.5722750 -18.8602012 0.0002049939 3.229641 -97.4140679
4 row-rsf -19.125000 2.75 21.8750000 -7.1441517 0.0027953086 3.671029 -26.8277783
5 row-civil_liberties 1.942110 2.75 0.8078900 2.2742727 0.1074862571 3.000494 -0.7752796
6 row-freedom_of_expression 1.948234 2.75 0.8017659 2.2809921 0.1067532047 3.002873 -0.7684766
conf.high method alternative
1 2.522761 Welch Two Sample t-test two.sided
2 -3.640444 Welch Two Sample t-test two.sided
3 -70.230482 Welch Two Sample t-test two.sided
4 -11.422222 Welch Two Sample t-test two.sided
5 4.659500 Welch Two Sample t-test two.sided
6 4.664945 Welch Two Sample t-test two.sided
数据
trust_news <- structure(list(row = c(1L, 2L, 3L, 5L), mean = c(2.68, 2.8, 3.22,
2.96), polity2 = c(8L, 8L, 10L, 10L), web = c(87.2661, 94.8967,
89.7391, 74.3872), rsf = c(25.69, 22.23, 14.6, 24.98), civil_liberties = c(0.785599008,
0.810742702, 0.821268417, 0.813949794), freedom_of_expression = c(0.758906967,
0.832891911, 0.83327835, 0.781986225), vdem_gov_censorship_effort = c(0.731895466,
0.8447733, 0.883343829, 0.844615869), vdem_self_censorship_effort = c(0.742219428,
0.831499528, 0.805721471, 0.729330399), vdem_freedom_of_expression = c(1,
1, 1, 0.666666667), ciri_freedom_of_speech_and_press = c(1, 1,
1, 0.5), media_integrity = c(0.81449235, 0.88417386, 0.829951651,
0.878769429), vdem_critical_press = c(0.889046047, 0.868772592,
0.917491749, 0.872387239), vdem_media_perspective = c(0.782079459,
0.881994928, 0.725950972, 0.919019442), vdem_media_bias = c(0.693825991,
0.835622928, 0.709774199, 0.841939049), vdem_media_corruption = c(0.733503755,
0.828566864, 0.874261064, 0.810193322), vdem_media_freedom = c(1,
1, 1, 0.5)), class = "data.frame", row.names = c(NA, -4L))
所以,我有一个包含大量变量的 DataFrame,我想用 t 检验交叉检查每个变量。
我的数据样本,名为 trust_news
:
row | mean | polity2 | web | rsf | civil_liberties | freedom_of_expression | vdem_gov_censorship_effort | vdem_self_censorship_effort | vdem_freedom_of_expression | ciri_freedom_of_speech_and_press | media_integrity | vdem_critical_press | vdem_media_perspective | vdem_media_bias | vdem_media_corruption | vdem_media_freedom |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2.68 | 8 | 87.2661 | 25.69 | 0.785599008 | 0.758906967 | 0.731895466 | 0.742219428 | 1 | 1 | 0.81449235 | 0.889046047 | 0.782079459 | 0.693825991 | 0.733503755 | 1 |
2 | 2.8 | 8 | 94.8967 | 22.23 | 0.810742702 | 0.832891911 | 0.8447733 | 0.831499528 | 1 | 1 | 0.88417386 | 0.868772592 | 0.881994928 | 0.835622928 | 0.828566864 | 1 |
3 | 3.22 | 10 | 89.7391 | 14.6 | 0.821268417 | 0.83327835 | 0.883343829 | 0.805721471 | 1 | 1 | 0.829951651 | 0.917491749 | 0.725950972 | 0.709774199 | 0.874261064 | 1 |
5 | 2.96 | 10 | 74.3872 | 24.98 | 0.813949794 | 0.781986225 | 0.844615869 | 0.729330399 | 0.666666667 | 0.5 | 0.878769429 | 0.872387239 | 0.919019442 | 0.841939049 | 0.810193322 | 0.5 |
然后,我运行这段代码就可以了:
trust_news_combos <- combn(trust_news, 1, t.test, simplify = TRUE)
首先,代码是否正确?我不知道在 combn()
函数中为 m
放什么。 AAanyway,那条线给了我这个:
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | c(t = 85.1670166474227) | c(t = 15.9614095646055) | c(t = 29.2365516170159) | c(t = 11.0778062107689) | c(t = 30.4673329981756) | c(t = 26.8521522144486) | c(t = 23.160185720972) | c(t = 25.1063414199952) | c(t = 17.1830959329723) | c(t = 11.06502519693) | c(t = 33.0841916129404) | c(t = 29.3707961673045) | c(t = 31.2455551028106) | c(t = 39.1490231250879) | c(t = 27.6089179039943) | c(t = 14.0719508946058) | ||||||||||||||||
2 | c(df = 32) | c(df = 32) | c(df = 32) | c(df = 32) | c(df = 32) | c(df = 32) | c(df = 32) | c(df = 32) | c(df = 32) | c(df = 32) | c(df = 32) | c(df = 32) | c(df = 32) | c(df = 32) | c(df = 32) | c(df = 32) | ||||||||||||||||
3 | 2.69E-39 | 8.55E-17 | 1.18E-24 | 1.75E-12 | 3.29E-25 | 1.61E-23 | 1.46E-21 | 1.26E-22 | 1.03E-17 | 1.80E-12 | 2.55E-26 | 1.02E-24 | 1.51E-25 | 1.32E-28 | 6.88E-24 | 2.96E-15 | ||||||||||||||||
4 | c(3.00189912275063 | 3.14900996815846) | c(7.56066019283154 | 9.77267314050179) | c(73.5097801046279 | 84.5198259559781) | c(19.628297122971 | 28.4729149982411) | c(0.682586494865725 | 0.780396107679729) | c(0.639468676034051 | 0.744449016935646) | c(0.664192511270674 | 0.792289818305084) | c(0.665160025455844 | 0.782621785210823) | c(0.676674167771883 | 0.858679367682662) | c(0.543941635486123 | 0.78939169784721) | c(0.739756992152986 | 0.836824222392469) | c(0.730937293702635 | 0.839876930600395) | c(0.729509614919607 | 0.831257822777363) | c(0.709894349786553 | 0.787820841122538) | c(0.708427672557418 | 0.821287114048642) | c(0.647915673315896 | 0.867235841835619) |
5 | c(mean of x = 3.07545454545455) |
c(mean of x = 8.66666666666667) |
c(mean of x = 79.014803030303) |
c(mean of x = 24.0506060606061) |
c(mean of x = 0.731491301272727) |
c(mean of x = 0.691958846484849) |
c(mean of x = 0.728241164787879) |
c(mean of x = 0.723890905333333) |
c(mean of x = 0.767676767727273) |
c(mean of x = 0.666666666666667) |
c(mean of x = 0.788290607272727) |
c(mean of x = 0.785407112151515) |
c(mean of x = 0.780383718848485) |
c(mean of x = 0.748857595454545) |
c(mean of x = 0.76485739330303) |
c(mean of x = 0.757575757575758) |
||||||||||||||||
6 | c(mean = 0) | c(mean = 0) | c(mean = 0) | c(mean = 0) | c(mean = 0) | c(mean = 0) | c(mean = 0) | c(mean = 0) | c(mean = 0) | c(mean = 0) | c(mean = 0) | c(mean = 0) | c(mean = 0) | c(mean = 0) | c(mean = 0) | c(mean = 0) | ||||||||||||||||
7 | 0.036110864 | 0.542976272 | 2.702603374 | 2.171062176 | 0.024009036 | 0.025769214 | 0.031443667 | 0.028832991 | 0.044676278 | 0.0602499 | 0.023826806 | 0.02674109 | 0.024975831 | 0.019128385 | 0.027703273 | 0.053835873 | ||||||||||||||||
8 | two.sided | two.sided | two.sided | two.sided | two.sided | two.sided | two.sided | two.sided | two.sided | two.sided | two.sided | two.sided | two.sided | two.sided | two.sided | two.sided | ||||||||||||||||
9 | One Sample t-test | One Sample t-test | One Sample t-test | One Sample t-test | One Sample t-test | One Sample t-test | One Sample t-test | One Sample t-test | One Sample t-test | One Sample t-test | One Sample t-test | One Sample t-test | One Sample t-test | One Sample t-test | One Sample t-test | One Sample t-test | ||||||||||||||||
10 | x[a] | x[a] | x[a] | x[a] | x[a] | x[a] | x[a] | x[a] | x[a] | x[a] | x[a] | x[a] | x[a] | x[a] | x[a] | x[a] |
它为我提供了第 3 行中要查找的 p 值,但如何检查正在检查的是哪两列?
感谢任何帮助,并将在我的最终代码中表示感谢!
您应该编写一个小函数来准确计算您需要的内容,并使用它代替标准函数 t.test
。例如:
# get four column names
cols <- names(mtcars)[1:4] # use trust_news instead of mtcars, and keep all the names
# compute the pval for a pair of names
pval <- function(pair) {
value <- t.test(mtcars[, pair[1]], mtcars[, pair[2]])$p.value
names(value) <- paste(pair, collapse = " vs. ")
value
}
# do it for all pairs. Don't simplify, and it will keep the names
combn(cols, 2, pval, simplify = FALSE)
#> [[1]]
#> mpg vs. cyl
#> 9.507708e-15
#>
#> [[2]]
#> mpg vs. disp
#> 7.978234e-11
#>
#> [[3]]
#> mpg vs. hp
#> 1.030354e-11
#>
#> [[4]]
#> cyl vs. disp
#> 1.774454e-11
#>
#> [[5]]
#> cyl vs. hp
#> 8.321996e-13
#>
#> [[6]]
#> disp vs. hp
#> 0.001545647
由 reprex package (v2.0.0)
于 2021-05-22 创建一种方法是在列名
上创建第二个combn
nm1 <- combn(names(trust_news), 2, FUN = paste, collapse= '-', simplify = TRUE)
然后,我们做
trust_news_combos <- combn(trust_news, 2, t.test, simplify = FALSE)
names(trust_new_combos) <- nm1
最好在数据中获取输出。frame/tibble 结构 broom
使用 tidy
(运行 in R 4.1.0
)
library(broom)
lst1 <- combn(trust_news, 2, \(y) t.test(y[1], y[2]) |>
tidy(), simplify = FALSE) |>
setNames(nm1)
out <- Map(cbind, comparison = names(lst1), lst1) |>
{\(x) do.call(rbind, x)}()
row.names(out) <- NULL
-输出
head(out)
comparison estimate estimate1 estimate2 statistic p.value parameter conf.low
1 row-mean -0.165000 2.75 2.9150000 -0.1914478 0.8599889461 3.112075 -2.8527609
2 row-polity2 -6.250000 2.75 9.0000000 -6.0633906 0.0014638846 5.268737 -8.8595564
3 row-web -83.822275 2.75 86.5722750 -18.8602012 0.0002049939 3.229641 -97.4140679
4 row-rsf -19.125000 2.75 21.8750000 -7.1441517 0.0027953086 3.671029 -26.8277783
5 row-civil_liberties 1.942110 2.75 0.8078900 2.2742727 0.1074862571 3.000494 -0.7752796
6 row-freedom_of_expression 1.948234 2.75 0.8017659 2.2809921 0.1067532047 3.002873 -0.7684766
conf.high method alternative
1 2.522761 Welch Two Sample t-test two.sided
2 -3.640444 Welch Two Sample t-test two.sided
3 -70.230482 Welch Two Sample t-test two.sided
4 -11.422222 Welch Two Sample t-test two.sided
5 4.659500 Welch Two Sample t-test two.sided
6 4.664945 Welch Two Sample t-test two.sided
数据
trust_news <- structure(list(row = c(1L, 2L, 3L, 5L), mean = c(2.68, 2.8, 3.22,
2.96), polity2 = c(8L, 8L, 10L, 10L), web = c(87.2661, 94.8967,
89.7391, 74.3872), rsf = c(25.69, 22.23, 14.6, 24.98), civil_liberties = c(0.785599008,
0.810742702, 0.821268417, 0.813949794), freedom_of_expression = c(0.758906967,
0.832891911, 0.83327835, 0.781986225), vdem_gov_censorship_effort = c(0.731895466,
0.8447733, 0.883343829, 0.844615869), vdem_self_censorship_effort = c(0.742219428,
0.831499528, 0.805721471, 0.729330399), vdem_freedom_of_expression = c(1,
1, 1, 0.666666667), ciri_freedom_of_speech_and_press = c(1, 1,
1, 0.5), media_integrity = c(0.81449235, 0.88417386, 0.829951651,
0.878769429), vdem_critical_press = c(0.889046047, 0.868772592,
0.917491749, 0.872387239), vdem_media_perspective = c(0.782079459,
0.881994928, 0.725950972, 0.919019442), vdem_media_bias = c(0.693825991,
0.835622928, 0.709774199, 0.841939049), vdem_media_corruption = c(0.733503755,
0.828566864, 0.874261064, 0.810193322), vdem_media_freedom = c(1,
1, 1, 0.5)), class = "data.frame", row.names = c(NA, -4L))