R中相似变量的相关性

Question

我稍微编辑了数据table。

我想关联数据集中具有相似名称的变量：

   A_y  B_y  C_y  A_p  B_p  C_p
1  15   52   32   30   98   56
2  30   99   60   56   46   25
3  10   25   31   20   22   30
     ..........
n  55   23   85   12   34   52

我想获得

的相关性

A_y-A_p: 0.78
B_y-B_p: 0.88
C_y-C_p: 0.93

我如何在 R 中做到这一点？可能吗？

Answer 1

这真的很危险。语言定义未定义具有无效列名的 data.frames 的行为。重复的列名无效。

您应该重组您的输入数据。无论如何，这是一种处理输入数据的方法。

DF <- read.table(text = "   A  B  C  A  B  C
1 15 52 32 30 98 56
2 30 99 60 56 46 25
3 10 25 31 20 22 30", header = TRUE, check.names = FALSE)

sapply(unique(names(DF)), function(s) do.call(cor, unname(DF[, names(DF) == s])))
#        A          B          C 
#0.9995544  0.1585501 -0.6004010

#compare:
cor(c(15, 30, 10), c(30, 56, 20))
#[1] 0.9995544

Answer 2

这是另一个基础 R 选项

within(
  rev(
    stack(
      Map(
        function(x) do.call(cor, unname(x)),
        split.default(df, unique(gsub("_.*", "", names(df))))
      )
    )
  ),
  ind <- sapply(
    ind,
    function(x) {
      paste0(grep(paste0("^", x), names(df), value = TRUE),
        collapse = "-"
      )
    }
  )
)

这给出了

      ind     values
1 A_y-A_p  0.9995544
2 B_y-B_p  0.1585501
3 C_y-C_p -0.6004010

数据

df <- structure(list(A_y = c(15L, 30L, 10L), B_y = c(52L, 99L, 25L), 
    C_y = c(32L, 60L, 31L), A_p = c(30L, 56L, 20L), B_p = c(98L, 
    46L, 22L), C_p = c(56L, 25L, 30L)), class = "data.frame", row.names = c("1",
"2", "3"))

R中相似变量的相关性

Correlation of similar variables in R

r

correlation