dplyr:与 NA 的相关性

dplyr: correlations with NA

xx <- data.frame(group = rep(1:4, each=100), a = rnorm(100) , b = rnorm(100))
xx[c(1,14,33), 'b'] = NA

我正在尝试按组计算相关性,但当存在 NA 时出现错误。

library(dplyr)
xx %>% group_by(group) %>% summarize(COR=cor(a,b,na.rm=TRUE))
    
Error: Problem with `summarise()` column `COR`.
    i `COR = cor(a, b, na.rm = TRUE)`.
    x unused argument (na.rm = TRUE)
    i The error occurred in group 1: group = 1.
    Run `rlang::last_error()` to see where the error occurred.

cor中没有na.rm参数,是use。根据?cor,用法是

cor(x, y = NULL, use = "everything", method = c("pearson", "kendall", "spearman"))

use - an optional character string giving a method for computing covariances in the presence of missing values. This must be (an abbreviation of) one of the strings "everything", "all.obs", "complete.obs", "na.or.complete", or "pairwise.complete.obs".

library(dplyr)
xx %>%
   group_by(group) %>%
   summarize(COR=cor(a,b, use = "complete.obs"))

-输出

# A tibble: 4 × 2
  group   COR
  <int> <dbl>
1     1 0.166
2     2 0.190
3     3 0.190
4     4 0.190

如果有所有 NA 的组,则使用 "na.or.complete"(更新评论中的数据,组只有 NA)

xx %>%
    group_by(group) %>%
    summarize(COR=cor(a,b, use = "na.or.complete"))
# A tibble: 5 × 2
  group     COR
  <int>   <dbl>
1     1  0.0345
2     2 -0.397 
3     3  0.150 
4     4  0.376 
5     5 NA     

其中 returns 与 if/else 条件相同并使用 "complete.obs"

xx %>%
    group_by(group) %>%
    summarize(COR= if(any(complete.cases(a, b)))
     cor(a,b, use = "complete.obs") else NA_real_)
# A tibble: 5 × 2
  group     COR
  <int>   <dbl>
1     1  0.0345
2     2 -0.397 
3     3  0.150 
4     4  0.376 
5     5 NA