如何在 R 中包含超过 350 列的数据集中查找相关性
How to find correlations in a dataset containing over 350 columns in R
我有一个数据集,其中列有约 360 种测量类型,并且有 200 行,每行都有唯一 ID。
+-----+-------+--------+--------+---------+---------+---------+---+---------+
| | ID | M1 | M2 | M3 | M4 | M5 | … | M360 |
+-----+-------+--------+--------+---------+---------+---------+---+---------+
| 1 | 6F0ZC | 0.068 | 0.0691 | 37.727 | 42.6139 | 41.7356 | … | 44.9293 |
| 2 | 6F0ZY | 0.0641 | 0.0661 | 37.2551 | 43.2009 | 40.8979 | … | 45.7524 |
| 3 | 6F106 | 0.0661 | 0.0676 | 36.9686 | 42.9519 | 41.262 | … | 45.7038 |
| 4 | 6F108 | 0.0685 | 0.069 | 38.3026 | 43.5699 | 42.3 | … | 46.1701 |
| 5 | 6F10A | 0.0657 | 0.0668 | 37.8442 | 43.2453 | 41.7191 | … | 45.7597 |
| 6 | 6F19W | 0.0682 | 0.071 | 38.6493 | 42.4611 | 42.2224 | … | 45.3165 |
| 7 | 6F1A0 | 0.0681 | 0.069 | 39.3956 | 44.2963 | 44.1344 | … | 46.5918 |
| 8 | 6F1A6 | 0.0662 | 0.0666 | 38.5942 | 42.6359 | 42.2369 | … | 45.4439 |
| . | . | . | . | . | . | . | . | . |
| . | . | . | . | . | . | . | . | . |
| . | . | . | . | . | . | . | . | . |
| 199 | 6F1AA | 0.0665 | 0.0672 | 40.438 | 44.9896 | 44.9409 | … | 47.5938 |
| 200 | 6F1AC | 0.0659 | 0.0681 | 39.528 | 44.606 | 43.2454 | … | 46.4338 |
+-----+-------+--------+--------+---------+---------+---------+---+---------+
我试图在这些测量中找到相关性并检查高度相关的特征并将它们可视化。有这么多列,我无法绘制常规的相关图。 (chart.Correlation,corrgram,等等..)
我也尝试过使用 qgraph,但测量结果在一个地方变得杂乱无章,而且不是很直观。
library(qgraph)
qgraph(cor(df[-c(1)], use="pairwise"),
layout="spring",
label.cex=0.9,
minimum = 0.90,
label.scale=FALSE)
是否有一种好的方法来可视化它并说明这些测量值如何相互关联?
如评论中所述,corrplot(...)
可能是一个不错的选择。这是一个执行类似操作的 ggplot
选项。基本思路是画热力图,颜色代表相关系数。
# create artificial dataset - you have this already
set.seed(1) # for reproducible example
df <- matrix(rnorm(180*100),nr=100)
df <- do.call(cbind,lapply(1:180,function(i)cbind(df[,i],2*df[,i])))
# you start here
library(ggplot2)
library(reshape2)
cor.df <- as.data.frame(cor(df))
cor.df$x <- factor(rownames(cor.df), levels=rownames(cor.df))
gg.df <- melt(cor.df,id="x",variable.name="y", value.name="cor")
# tiles colored continuously based on correlation coefficient
ggplot(gg.df, aes(x,y,fill=cor))+
geom_tile()+
scale_fill_gradientn(colours=rev(heat.colors(10)))
coord_fixed()
# tiles colors based on increments in correlation coefficient
gg.df$level <- cut(gg.df$cor,breaks=6)
ggplot(gg.df, aes(x,y,fill=level))+
geom_tile()+
scale_fill_manual(values=rev(heat.colors(5)))+
coord_fixed()
注意对角线。这是设计使然 - 人为设置的数据使得第 i 行和第 i+1 行完全相关,每隔一行。
我有一个数据集,其中列有约 360 种测量类型,并且有 200 行,每行都有唯一 ID。
+-----+-------+--------+--------+---------+---------+---------+---+---------+
| | ID | M1 | M2 | M3 | M4 | M5 | … | M360 |
+-----+-------+--------+--------+---------+---------+---------+---+---------+
| 1 | 6F0ZC | 0.068 | 0.0691 | 37.727 | 42.6139 | 41.7356 | … | 44.9293 |
| 2 | 6F0ZY | 0.0641 | 0.0661 | 37.2551 | 43.2009 | 40.8979 | … | 45.7524 |
| 3 | 6F106 | 0.0661 | 0.0676 | 36.9686 | 42.9519 | 41.262 | … | 45.7038 |
| 4 | 6F108 | 0.0685 | 0.069 | 38.3026 | 43.5699 | 42.3 | … | 46.1701 |
| 5 | 6F10A | 0.0657 | 0.0668 | 37.8442 | 43.2453 | 41.7191 | … | 45.7597 |
| 6 | 6F19W | 0.0682 | 0.071 | 38.6493 | 42.4611 | 42.2224 | … | 45.3165 |
| 7 | 6F1A0 | 0.0681 | 0.069 | 39.3956 | 44.2963 | 44.1344 | … | 46.5918 |
| 8 | 6F1A6 | 0.0662 | 0.0666 | 38.5942 | 42.6359 | 42.2369 | … | 45.4439 |
| . | . | . | . | . | . | . | . | . |
| . | . | . | . | . | . | . | . | . |
| . | . | . | . | . | . | . | . | . |
| 199 | 6F1AA | 0.0665 | 0.0672 | 40.438 | 44.9896 | 44.9409 | … | 47.5938 |
| 200 | 6F1AC | 0.0659 | 0.0681 | 39.528 | 44.606 | 43.2454 | … | 46.4338 |
+-----+-------+--------+--------+---------+---------+---------+---+---------+
我试图在这些测量中找到相关性并检查高度相关的特征并将它们可视化。有这么多列,我无法绘制常规的相关图。 (chart.Correlation,corrgram,等等..)
我也尝试过使用 qgraph,但测量结果在一个地方变得杂乱无章,而且不是很直观。
library(qgraph)
qgraph(cor(df[-c(1)], use="pairwise"),
layout="spring",
label.cex=0.9,
minimum = 0.90,
label.scale=FALSE)
是否有一种好的方法来可视化它并说明这些测量值如何相互关联?
如评论中所述,corrplot(...)
可能是一个不错的选择。这是一个执行类似操作的 ggplot
选项。基本思路是画热力图,颜色代表相关系数。
# create artificial dataset - you have this already
set.seed(1) # for reproducible example
df <- matrix(rnorm(180*100),nr=100)
df <- do.call(cbind,lapply(1:180,function(i)cbind(df[,i],2*df[,i])))
# you start here
library(ggplot2)
library(reshape2)
cor.df <- as.data.frame(cor(df))
cor.df$x <- factor(rownames(cor.df), levels=rownames(cor.df))
gg.df <- melt(cor.df,id="x",variable.name="y", value.name="cor")
# tiles colored continuously based on correlation coefficient
ggplot(gg.df, aes(x,y,fill=cor))+
geom_tile()+
scale_fill_gradientn(colours=rev(heat.colors(10)))
coord_fixed()
# tiles colors based on increments in correlation coefficient
gg.df$level <- cut(gg.df$cor,breaks=6)
ggplot(gg.df, aes(x,y,fill=level))+
geom_tile()+
scale_fill_manual(values=rev(heat.colors(5)))+
coord_fixed()
注意对角线。这是设计使然 - 人为设置的数据使得第 i 行和第 i+1 行完全相关,每隔一行。