为什么这两个公式给出两个不同的相关图？

Question

我正在尝试计算数据框（名为“df”）的相关矩阵，该矩阵同时包含数值变量和布尔变量（true、false）并且有一些缺失值。

DF很像

df <- data.frame(
idcode = c(1:10),
contract = c ("TRUE", "FALSE", "FALSE", "FALSE", NA, NA, "TRUE", "TRUE", 
"FALSE", "TRUE"),
score = c (1.17, 5, 7.2, 6.6, 3, 3.8, 7.2, 9.1, 5.4, 2.21),
CEO = c("FALSE", 
NA,"TRUE","TRUE","TRUE","TRUE","TRUE","TRUE","TRUE","TRUE"))

我找到了两个类似的替代方法来计算这个，但它们给我不同的结果：

data.matrix(df) %>% cor(use="pairwise.complete.obs") %>% round(digit=3)

和

model.matrix(~0+., data=df) %>% cor(use="pairwise.complete.obs") %>% round(digit=3)

谁能解释一下为什么两个相关矩阵不同，在这种情况下计算相关矩阵的正确方法是什么？

例如，为什么 CEO-Score 对的相关性不同？

Answer 1

model.matrix 和 data.matrix 这两个函数在几个方面表现不同，包括如果有 NA 值会发生什么，以及如何处理 non-numeric 变量。请参阅帮助页面。

默认情况下，当使用 model.matrix 时，会在存在 NA 的情况下删除整行。在 data.matrix 中，这些被保留并有助于 cor(use = "pairwise.complete.obs") 观察，如果不是整行都是 NA。这就解释了不同的相关系数。

如果您必须使用 model.matrix，您可以设置选项以传递 NA 值（参见解决方案 here）并处理 [=27] 中的 NA 值=].

获取数据

library(tidyverse)

df <- data.frame(
  idcode = c(1:10),
  contract = c(TRUE,FALSE,FALSE,FALSE,NA,NA,TRUE,TRUE,FALSE,TRUE),
  score = c (1.17, 5, 7.2, 6.6, 3, 3.8, 7.2, 9.1, 5.4, 2.21),
  CEO = c(FALSE,NA,TRUE,TRUE,TRUE,TRUE,TRUE,TRUE,TRUE,TRUE))

请注意，逻辑变量的编码应不带“”，但结果在这里看起来是一样的。

model.matrix

的默认行为

如果有 NA 个值，model.matrix 会删除整行，而 data.matrix 会保留它们。这是由于默认 options()$na.action，它设置为 na.omit，并且只影响 model.matrix。

options()$na.action
#[1] "na.omit"

model.matrix(~0 + ., data = df)
#>    idcode contractFALSE contractTRUE score CEOTRUE
#> 1       1             0            1  1.17       0
#> 3       3             1            0  7.20       1
#> 4       4             1            0  6.60       1
#> 7       7             0            1  7.20       1
#> 8       8             0            1  9.10       1
#> 9       9             1            0  5.40       1
#> 10     10             0            1  2.21       1
#> attr(,"assign")
#> [1] 1 2 2 3 4
#> attr(,"contrasts")
#> attr(,"contrasts")$contract
#> [1] "contr.treatment"
#> 
#> attr(,"contrasts")$CEO
#> [1] "contr.treatment"

data.matrix(df)
#>       idcode contract score CEO
#>  [1,]      1        2  1.17   1
#>  [2,]      2        1  5.00  NA
#>  [3,]      3        1  7.20   2
#>  [4,]      4        1  6.60   2
#>  [5,]      5       NA  3.00   2
#>  [6,]      6       NA  3.80   2
#>  [7,]      7        2  7.20   2
#>  [8,]      8        2  9.10   2
#>  [9,]      9        1  5.40   2
#> [10,]     10        2  2.21   2

行为 na.action = "na.pass"

# set na.action options
oldpar <- options()$na.action
options(na.action ="na.pass")

model.matrix(~0 + ., data = df)
#>    idcode contractFALSE contractTRUE score CEOTRUE
#> 1       1             0            1  1.17       0
#> 2       2             1            0  5.00      NA
#> 3       3             1            0  7.20       1
#> 4       4             1            0  6.60       1
#> 5       5            NA           NA  3.00       1
#> 6       6            NA           NA  3.80       1
#> 7       7             0            1  7.20       1
#> 8       8             0            1  9.10       1
#> 9       9             1            0  5.40       1
#> 10     10             0            1  2.21       1
#> attr(,"assign")
#> [1] 1 2 2 3 4
#> attr(,"contrasts")
#> attr(,"contrasts")$contract
#> [1] "contr.treatment"
#> 
#> attr(,"contrasts")$CEO
#> [1] "contr.treatment"

data.matrix(df)
#>       idcode contract score CEO
#>  [1,]      1        2  1.17   1
#>  [2,]      2        1  5.00  NA
#>  [3,]      3        1  7.20   2
#>  [4,]      4        1  6.60   2
#>  [5,]      5       NA  3.00   2
#>  [6,]      6       NA  3.80   2
#>  [7,]      7        2  7.20   2
#>  [8,]      8        2  9.10   2
#>  [9,]      9        1  5.40   2
#> [10,]     10        2  2.21   2

比较相关系数

data.matrix(df) %>% cor(use="pairwise.complete.obs") %>% round(digit=3)
#>          idcode contract  score    CEO
#> idcode    1.000    0.312  0.177  0.625
#> contract  0.312    1.000 -0.226 -0.354
#> score     0.177   -0.226  1.000  0.548
#> CEO       0.625   -0.354  0.548  1.000

model.matrix(~0+., data=df) %>% cor(use="pairwise.complete.obs") %>% round(digit=3)
#>               idcode contractFALSE contractTRUE  score CEOTRUE
#> idcode         1.000        -0.312        0.312  0.177   0.625
#> contractFALSE -0.312         1.000       -1.000  0.226   0.354
#> contractTRUE   0.312        -1.000        1.000 -0.226  -0.354
#> score          0.177         0.226       -0.226  1.000   0.548
#> CEOTRUE        0.625         0.354       -0.354  0.548   1.000

请注意，这两个函数处理逻辑变量数据的方式不同（model.matrix 为 contract 创建两个虚拟变量，为 CEO[ 创建一个虚拟变量=77=]（参见本答案评论部分的讨论），data.matrix 创建单个二进制整数变量），这反映在相关矩阵中。

重置默认选项

options(na.action = oldpar)

Session 信息

sessionInfo() #> R version 4.1.1 (2021-08-10) #> Platform: x86_64-apple-darwin17.0 (64-bit) #> Running under: macOS Catalina 10.15.7 #> #> Matrix products: default #> BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib #> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib #> #> locale: #> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 #> #> attached base packages: #> [1] stats graphics grDevices utils datasets methods base #> #> loaded via a namespace (and not attached): #> [1] knitr_1.33 magrittr_2.0.1 rlang_0.4.11 fastmap_1.1.0 #> [5] fansi_0.5.0 stringr_1.4.0 styler_1.5.1 highr_0.9 #> [9] tools_4.1.1 xfun_0.25 utf8_1.2.2 withr_2.4.2 #> [13] htmltools_0.5.2 ellipsis_0.3.2 yaml_2.2.1 digest_0.6.27 #> [17] tibble_3.1.4 lifecycle_1.0.0 crayon_1.4.1 purrr_0.3.4 #> [21] vctrs_0.3.8 fs_1.5.0 glue_1.4.2 evaluate_0.14 #> [25] rmarkdown_2.10 reprex_2.0.1 stringi_1.7.4 compiler_4.1.1 #> [29] pillar_1.6.2 backports_1.2.1 pkgconfig_2.0.3

^{由 reprex package (v2.0.1)}
于 2021-09-19 创建

为什么这两个公式给出两个不同的相关图？

Why these two formulas give two different correlograms?

r

correlation