R 创建列组合之间的综合 table 相关性
R creating a comprehensive table of correlation between combinations of columns
这是我的数据集。我在看棒球数据。
structure(list(INDEX = 1:6, TARGET_WINS = c(39L, 70L, 86L, 70L,
82L, 75L), TEAM_BATTING_H = c(1445L, 1339L, 1377L, 1387L, 1297L,
1279L), TEAM_BATTING_2B = c(194L, 219L, 232L, 209L, 186L, 200L
), TEAM_BATTING_3B = c(39L, 22L, 35L, 38L, 27L, 36L), TEAM_BATTING_HR = c(13L,
190L, 137L, 96L, 102L, 92L), TEAM_BATTING_BB = c(143L, 685L,
602L, 451L, 472L, 443L), TEAM_BATTING_SO = c(842L, 1075L, 917L,
922L, 920L, 973L), TEAM_BASERUN_SB = c(NA, 37L, 46L, 43L, 49L,
107L), TEAM_BASERUN_CS = c(NA, 28L, 27L, 30L, 39L, 59L), TEAM_BATTING_HBP = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), TEAM_PITCHING_H = c(9364L, 1347L, 1377L, 1396L, 1297L, 1279L
), TEAM_PITCHING_HR = c(84L, 191L, 137L, 97L, 102L, 92L), TEAM_PITCHING_BB = c(927L,
689L, 602L, 454L, 472L, 443L), TEAM_PITCHING_SO = c(5456L, 1082L,
917L, 928L, 920L, 973L), TEAM_FIELDING_E = c(1011L, 193L, 175L,
164L, 138L, 123L), TEAM_FIELDING_DP = c(NA, 155L, 153L, 156L,
168L, 149L)), row.names = c(NA, 6L), class = "data.frame")
我正在尝试创建多元线性回归并决定要包含哪些预测变量。问题是,我认为其中一些变量将真正相互关联。例如,其中一列是“击球手的基本安打(任何类型的安打)”,另一列是“击球手的双打”等等。所以我认为如果一个玩家得分双倍,它会在多个不同的列中检查 +1。
我试图弄清楚要包括哪些变量,我想到的一个策略是确定这些变量中的哪些变量彼此相关,以及它们的相关强度。也许我不会包括彼此真正密切相关的变量。 (对此有帮助吗?)
我开始走这条路,一个一个地看皮尔逊相关:
cor(moneyball_training_data$TEAM_BATTING_H, moneyball_training_data$TEAM_BATTING_2B)
cor(moneyball_training_data$TEAM_BATTING_H, moneyball_training_data$TEAM_BATTING_3B)
cor(moneyball_training_data$TEAM_BATTING_H, moneyball_training_data$TEAM_BATTING_HR)
但后来我看到所有这些变量之间有多少排列!此数据框中有 16 列,我想 select 任意两列:16! / (2! (16 - 2)!) 如果我的数学是正确的,那么通过这种方法执行这将是 120 行代码,并且很容易纠结并且忘记我已经完成了哪些代码。 ..所以效率不高。
所以我最初的问题是:是否有任何有效的编码方法来比较数据帧中变量之间的综合相关集?
然后我在 Stack Overflow 上找到了 ,我认为它回答了我的问题,但我仍然不能完全让它工作。
旁注 - 我还试图找出哪些列具有 NA 值,以防此处的 NA 值有所不同。
any(is.na(moneyball_training_data$TARGET_WINS)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_H)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_2B)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_3B)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_HR)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_BB)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_SO)) # TRUE
any(is.na(moneyball_training_data$TEAM_BATTING_SB)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_CS)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_HBP)) # TRUE
any(is.na(moneyball_training_data$TEAM_PITCHING_H)) # FALSE
any(is.na(moneyball_training_data$TEAM_PITCHING_HR)) # FALSE
any(is.na(moneyball_training_data$TEAM_PITCHING_BB)) # FALSE
any(is.na(moneyball_training_data$TEAM_PITCHING_SO))# TRUE
any(is.na(moneyball_training_data$TEAM_FIELDING_E)) # FALSE
any(is.na(moneyball_training_data$TEAM_FIELDING_DP)) # TRUE
(旁注 - 是否有更有效的方法来执行此操作(is.na))代码?)
为了继续,我现在按照另一个 Stack Overflow 答案的方向,即 tidy 方法,我不太理解,但给出答案的人看起来很聪明:
# function to use later (to filter out rows)
f = function(x,y) grepl(x,y)
f = Vectorize(f)
moneyball_training_data %>%
select(-INDEX) %>% # remove unnecessary columns
cor() %>% # get all correlations (even ones you don't care about)
data.frame() %>% # save result as a dataframe
mutate(v1 = row.names(.)) %>% # add row names as a column
gather(v2,cor, -v1) %>% # reshape data
filter(f(v1,v2) & v1 != v2)
但是结果怎么可能只是一个 3 x 3 的数据框呢?我期望像下面的图那样,其中每个数字都是 x 和 y 的相关性,其中删除了多余的空白空间。
1 2 3 4 5 6 7
1 12 13 14 15 16 17
2 23 24 25 26 27
3 34 35 36 37
4 45 46 47
5 56 57
6 67
7
这样的矩阵你期待吗?
df <- structure(list(INDEX = 1:6, TARGET_WINS = c(39L, 70L, 86L, 70L,
82L, 75L), TEAM_BATTING_H = c(1445L, 1339L, 1377L, 1387L, 1297L,
1279L), TEAM_BATTING_2B = c(194L, 219L, 232L, 209L, 186L, 200L
), TEAM_BATTING_3B = c(39L, 22L, 35L, 38L, 27L, 36L), TEAM_BATTING_HR = c(13L,
190L, 137L, 96L, 102L, 92L), TEAM_BATTING_BB = c(143L, 685L,
602L, 451L, 472L, 443L), TEAM_BATTING_SO = c(842L, 1075L, 917L,
922L, 920L, 973L), TEAM_BASERUN_SB = c(NA, 37L, 46L, 43L, 49L,
107L), TEAM_BASERUN_CS = c(NA, 28L, 27L, 30L, 39L, 59L), TEAM_BATTING_HBP = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), TEAM_PITCHING_H = c(9364L, 1347L, 1377L, 1396L, 1297L, 1279L
), TEAM_PITCHING_HR = c(84L, 191L, 137L, 97L, 102L, 92L), TEAM_PITCHING_BB = c(927L,
689L, 602L, 454L, 472L, 443L), TEAM_PITCHING_SO = c(5456L, 1082L,
917L, 928L, 920L, 973L), TEAM_FIELDING_E = c(1011L, 193L, 175L,
164L, 138L, 123L), TEAM_FIELDING_DP = c(NA, 155L, 153L, 156L,
168L, 149L)), row.names = c(NA, 6L), class = "data.frame")
# install.packages("corrr")
library(corrr)
df1 <- corrr::correlate(df, method = "pearson")
# 1. Output:
# A tibble: 17 x 18
term INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 INDEX NA 0.642 -0.820 -0.291 0.0236 0.0826 0.205
2 TARG~ 0.642 NA -0.685 0.367 -0.373 0.673 0.788
3 TEAM~ -0.820 -0.685 NA 0.192 0.496 -0.449 -0.502
4 TEAM~ -0.291 0.367 0.192 NA -0.0789 0.640 0.653
5 TEAM~ 0.0236 -0.373 0.496 -0.0789 NA -0.752 -0.676
6 TEAM~ 0.0826 0.673 -0.449 0.640 -0.752 NA 0.984
7 TEAM~ 0.205 0.788 -0.502 0.653 -0.676 0.984 NA
8 TEAM~ 0.134 0.401 -0.560 0.377 -0.754 0.864 0.799
9 TEAM~ 0.790 -0.00267 -0.690 -0.356 0.413 -0.528 -0.541
10 TEAM~ 0.874 -0.0332 -0.834 -0.598 0.261 -0.578 -0.623
11 TEAM~ NA NA NA NA NA NA NA
12 TEAM~ -0.662 -0.923 0.733 -0.358 0.448 -0.771 -0.852
13 TEAM~ -0.352 0.308 -0.127 0.661 -0.767 0.891 0.809
14 TEAM~ -0.914 -0.793 0.736 0.0225 0.0863 -0.341 -0.464
15 TEAM~ -0.667 -0.930 0.719 -0.360 0.424 -0.757 -0.842
16 TEAM~ -0.707 -0.925 0.757 -0.314 0.418 -0.733 -0.820
17 TEAM~ 0.0666 0.265 -0.144 -0.583 -0.447 -0.123 -0.150
快速回答隐藏在这个 post 中的一个附带问题:更有效的方法来查找其中包含 NA 值的列,而不是逐一查找
moneyball_training_data %>% summarise(across(, ~ any(is.na(.x))))
这是我的数据集。我在看棒球数据。
structure(list(INDEX = 1:6, TARGET_WINS = c(39L, 70L, 86L, 70L,
82L, 75L), TEAM_BATTING_H = c(1445L, 1339L, 1377L, 1387L, 1297L,
1279L), TEAM_BATTING_2B = c(194L, 219L, 232L, 209L, 186L, 200L
), TEAM_BATTING_3B = c(39L, 22L, 35L, 38L, 27L, 36L), TEAM_BATTING_HR = c(13L,
190L, 137L, 96L, 102L, 92L), TEAM_BATTING_BB = c(143L, 685L,
602L, 451L, 472L, 443L), TEAM_BATTING_SO = c(842L, 1075L, 917L,
922L, 920L, 973L), TEAM_BASERUN_SB = c(NA, 37L, 46L, 43L, 49L,
107L), TEAM_BASERUN_CS = c(NA, 28L, 27L, 30L, 39L, 59L), TEAM_BATTING_HBP = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), TEAM_PITCHING_H = c(9364L, 1347L, 1377L, 1396L, 1297L, 1279L
), TEAM_PITCHING_HR = c(84L, 191L, 137L, 97L, 102L, 92L), TEAM_PITCHING_BB = c(927L,
689L, 602L, 454L, 472L, 443L), TEAM_PITCHING_SO = c(5456L, 1082L,
917L, 928L, 920L, 973L), TEAM_FIELDING_E = c(1011L, 193L, 175L,
164L, 138L, 123L), TEAM_FIELDING_DP = c(NA, 155L, 153L, 156L,
168L, 149L)), row.names = c(NA, 6L), class = "data.frame")
我正在尝试创建多元线性回归并决定要包含哪些预测变量。问题是,我认为其中一些变量将真正相互关联。例如,其中一列是“击球手的基本安打(任何类型的安打)”,另一列是“击球手的双打”等等。所以我认为如果一个玩家得分双倍,它会在多个不同的列中检查 +1。
我试图弄清楚要包括哪些变量,我想到的一个策略是确定这些变量中的哪些变量彼此相关,以及它们的相关强度。也许我不会包括彼此真正密切相关的变量。 (对此有帮助吗?)
我开始走这条路,一个一个地看皮尔逊相关:
cor(moneyball_training_data$TEAM_BATTING_H, moneyball_training_data$TEAM_BATTING_2B)
cor(moneyball_training_data$TEAM_BATTING_H, moneyball_training_data$TEAM_BATTING_3B)
cor(moneyball_training_data$TEAM_BATTING_H, moneyball_training_data$TEAM_BATTING_HR)
但后来我看到所有这些变量之间有多少排列!此数据框中有 16 列,我想 select 任意两列:16! / (2! (16 - 2)!) 如果我的数学是正确的,那么通过这种方法执行这将是 120 行代码,并且很容易纠结并且忘记我已经完成了哪些代码。 ..所以效率不高。
所以我最初的问题是:是否有任何有效的编码方法来比较数据帧中变量之间的综合相关集?
然后我在 Stack Overflow 上找到了
旁注 - 我还试图找出哪些列具有 NA 值,以防此处的 NA 值有所不同。
any(is.na(moneyball_training_data$TARGET_WINS)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_H)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_2B)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_3B)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_HR)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_BB)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_SO)) # TRUE
any(is.na(moneyball_training_data$TEAM_BATTING_SB)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_CS)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_HBP)) # TRUE
any(is.na(moneyball_training_data$TEAM_PITCHING_H)) # FALSE
any(is.na(moneyball_training_data$TEAM_PITCHING_HR)) # FALSE
any(is.na(moneyball_training_data$TEAM_PITCHING_BB)) # FALSE
any(is.na(moneyball_training_data$TEAM_PITCHING_SO))# TRUE
any(is.na(moneyball_training_data$TEAM_FIELDING_E)) # FALSE
any(is.na(moneyball_training_data$TEAM_FIELDING_DP)) # TRUE
(旁注 - 是否有更有效的方法来执行此操作(is.na))代码?)
为了继续,我现在按照另一个 Stack Overflow 答案的方向,即 tidy 方法,我不太理解,但给出答案的人看起来很聪明:
# function to use later (to filter out rows)
f = function(x,y) grepl(x,y)
f = Vectorize(f)
moneyball_training_data %>%
select(-INDEX) %>% # remove unnecessary columns
cor() %>% # get all correlations (even ones you don't care about)
data.frame() %>% # save result as a dataframe
mutate(v1 = row.names(.)) %>% # add row names as a column
gather(v2,cor, -v1) %>% # reshape data
filter(f(v1,v2) & v1 != v2)
但是结果怎么可能只是一个 3 x 3 的数据框呢?我期望像下面的图那样,其中每个数字都是 x 和 y 的相关性,其中删除了多余的空白空间。
1 2 3 4 5 6 7
1 12 13 14 15 16 17
2 23 24 25 26 27
3 34 35 36 37
4 45 46 47
5 56 57
6 67
7
这样的矩阵你期待吗?
df <- structure(list(INDEX = 1:6, TARGET_WINS = c(39L, 70L, 86L, 70L,
82L, 75L), TEAM_BATTING_H = c(1445L, 1339L, 1377L, 1387L, 1297L,
1279L), TEAM_BATTING_2B = c(194L, 219L, 232L, 209L, 186L, 200L
), TEAM_BATTING_3B = c(39L, 22L, 35L, 38L, 27L, 36L), TEAM_BATTING_HR = c(13L,
190L, 137L, 96L, 102L, 92L), TEAM_BATTING_BB = c(143L, 685L,
602L, 451L, 472L, 443L), TEAM_BATTING_SO = c(842L, 1075L, 917L,
922L, 920L, 973L), TEAM_BASERUN_SB = c(NA, 37L, 46L, 43L, 49L,
107L), TEAM_BASERUN_CS = c(NA, 28L, 27L, 30L, 39L, 59L), TEAM_BATTING_HBP = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), TEAM_PITCHING_H = c(9364L, 1347L, 1377L, 1396L, 1297L, 1279L
), TEAM_PITCHING_HR = c(84L, 191L, 137L, 97L, 102L, 92L), TEAM_PITCHING_BB = c(927L,
689L, 602L, 454L, 472L, 443L), TEAM_PITCHING_SO = c(5456L, 1082L,
917L, 928L, 920L, 973L), TEAM_FIELDING_E = c(1011L, 193L, 175L,
164L, 138L, 123L), TEAM_FIELDING_DP = c(NA, 155L, 153L, 156L,
168L, 149L)), row.names = c(NA, 6L), class = "data.frame")
# install.packages("corrr")
library(corrr)
df1 <- corrr::correlate(df, method = "pearson")
# 1. Output:
# A tibble: 17 x 18
term INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 INDEX NA 0.642 -0.820 -0.291 0.0236 0.0826 0.205
2 TARG~ 0.642 NA -0.685 0.367 -0.373 0.673 0.788
3 TEAM~ -0.820 -0.685 NA 0.192 0.496 -0.449 -0.502
4 TEAM~ -0.291 0.367 0.192 NA -0.0789 0.640 0.653
5 TEAM~ 0.0236 -0.373 0.496 -0.0789 NA -0.752 -0.676
6 TEAM~ 0.0826 0.673 -0.449 0.640 -0.752 NA 0.984
7 TEAM~ 0.205 0.788 -0.502 0.653 -0.676 0.984 NA
8 TEAM~ 0.134 0.401 -0.560 0.377 -0.754 0.864 0.799
9 TEAM~ 0.790 -0.00267 -0.690 -0.356 0.413 -0.528 -0.541
10 TEAM~ 0.874 -0.0332 -0.834 -0.598 0.261 -0.578 -0.623
11 TEAM~ NA NA NA NA NA NA NA
12 TEAM~ -0.662 -0.923 0.733 -0.358 0.448 -0.771 -0.852
13 TEAM~ -0.352 0.308 -0.127 0.661 -0.767 0.891 0.809
14 TEAM~ -0.914 -0.793 0.736 0.0225 0.0863 -0.341 -0.464
15 TEAM~ -0.667 -0.930 0.719 -0.360 0.424 -0.757 -0.842
16 TEAM~ -0.707 -0.925 0.757 -0.314 0.418 -0.733 -0.820
17 TEAM~ 0.0666 0.265 -0.144 -0.583 -0.447 -0.123 -0.150
快速回答隐藏在这个 post 中的一个附带问题:更有效的方法来查找其中包含 NA 值的列,而不是逐一查找
moneyball_training_data %>% summarise(across(, ~ any(is.na(.x))))