不同样本量的两组基因对之间的相关性

correlation between genepaires in two groups with different sample sizes

我有两组患者,它们的基因暗淡相同但患者(样本)暗淡不同。每组不同的样本是一个生物复制。

sample1 <- structure(c(3.990406045, 4.041745321, 4.002401404, 4.031463584, 
4.046886189, 4.00582865, 3.985265177, 3.9869788, 3.995546913, 
4.00582865, 11.75549075, 11.81394311, 11.81826206, 11.76013913, 
11.8408451, 11.83619671, 11.72858876, 11.73755609, 11.78239274, 
11.83619671, 8.647734791, 8.606480387, 8.64648886, 8.607548328, 
8.605946416, 8.646132879, 8.648268762, 8.648090771, 8.647200821, 
8.646132879, 5.359884744, 5.371302287, 5.37638989, 5.357155019, 
5.378375921, 5.381105646, 5.35281111, 5.355168988, 5.366958378, 
5.381105646, 8.805045323, 8.684889613, 8.794736874, 8.693725426, 
8.680471706, 8.791791603, 8.80946323, 8.807990594, 8.800627416, 
8.791791603, 10.87587031, 10.85539252, 10.87095037, 10.85960961, 
10.85328398, 10.86954467, 10.87797885, 10.877276, 10.87376176, 
10.86954467, 5.505422817, 5.530799682, 5.631682175, 5.422577376, 
5.584910836, 5.667756277, 5.451311664, 5.469348715, 5.559533971, 
5.667756277), .Dim = c(10L, 7L), .Dimnames = list(c("patient1", 
"patient2", "patient3", "patient4", 
"patient5", "patient6", "patient7", 
"patient8", "patient9", "patient10"
), c("gene1", "gene2", "gene3", "gene4", "gene5", "gene6", "gene7"
)))

sample2 <- structure(c(3.990406045, 4.041745321, 4.002401404, 4.031463584, 
4.046886189, 4.00582865, 3.985265177, 3.9869788, 11.75549075, 
11.81394311, 11.81826206, 11.76013913, 11.8408451, 11.83619671, 
11.72858876, 11.73755609, 8.647734791, 8.606480387, 8.64648886, 
8.607548328, 8.605946416, 8.646132879, 8.648268762, 8.648090771, 
5.359884744, 5.371302287, 5.37638989, 5.357155019, 5.378375921, 
5.381105646, 5.35281111, 5.355168988, 8.805045323, 8.684889613, 
8.794736874, 8.693725426, 8.680471706, 8.791791603, 8.80946323, 
8.807990594, 10.87587031, 10.85539252, 10.87095037, 10.85960961, 
10.85328398, 10.86954467, 10.87797885, 10.877276, 5.505422817, 
5.530799682, 5.631682175, 5.422577376, 5.584910836, 5.667756277, 
5.451311664, 5.469348715), .Dim = c(8L, 7L), .Dimnames = list(
c("patient1", 
"patient2", "patient3", "patient4", 
"patient5", "patient6", "patient7", 
"patient8"), c("gene1", "gene2", "gene3", "gene4", "gene5", "gene6", "gene7")))

现在,我想检查两组基因对之间的相关性

rcorr(sample1, sample2, type="s")#spearman

我收到:

Error in cbind(x, y) : number of rows of matrices must match (see arg 2)

但是对于患者的相关性,t(样本)带来了患者对之间的相关性。我需要基因对之间的相关性(如下)。有什么问题吗?我应该考虑一些统计点吗?

当患者的长度相等时,我会想到:

> rcorr(sample1[1:8, ], sample2[1:8,], type="s")
      gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene1 gene2 gene3 gene4 gene5 gene6 gene7
gene1  1.00  0.81 -1.00  0.67 -1.00 -1.00  0.36  1.00  0.81 -1.00  0.67 -1.00 -1.00  0.36
gene2  0.81  1.00 -0.81  0.95 -0.81 -0.81  0.79  0.81  1.00 -0.81  0.95 -0.81 -0.81  0.79
gene3 -1.00 -0.81  1.00 -0.67  1.00  1.00 -0.36 -1.00 -0.81  1.00 -0.67  1.00  1.00 -0.36
gene4  0.67  0.95 -0.67  1.00 -0.67 -0.67  0.90  0.67  0.95 -0.67  1.00 -0.67 -0.67  0.90
gene5 -1.00 -0.81  1.00 -0.67  1.00  1.00 -0.36 -1.00 -0.81  1.00 -0.67  1.00  1.00 -0.36
gene6 -1.00 -0.81  1.00 -0.67  1.00  1.00 -0.36 -1.00 -0.81  1.00 -0.67  1.00  1.00 -0.36
gene7  0.36  0.79 -0.36  0.90 -0.36 -0.36  1.00  0.36  0.79 -0.36  0.90 -0.36 -0.36  1.00
gene1  1.00  0.81 -1.00  0.67 -1.00 -1.00  0.36  1.00  0.81 -1.00  0.67 -1.00 -1.00  0.36
gene2  0.81  1.00 -0.81  0.95 -0.81 -0.81  0.79  0.81  1.00 -0.81  0.95 -0.81 -0.81  0.79
gene3 -1.00 -0.81  1.00 -0.67  1.00  1.00 -0.36 -1.00 -0.81  1.00 -0.67  1.00  1.00 -0.36
gene4  0.67  0.95 -0.67  1.00 -0.67 -0.67  0.90  0.67  0.95 -0.67  1.00 -0.67 -0.67  0.90
gene5 -1.00 -0.81  1.00 -0.67  1.00  1.00 -0.36 -1.00 -0.81  1.00 -0.67  1.00  1.00 -0.36
gene6 -1.00 -0.81  1.00 -0.67  1.00  1.00 -0.36 -1.00 -0.81  1.00 -0.67  1.00  1.00 -0.36
gene7  0.36  0.79 -0.36  0.90 -0.36 -0.36  1.00  0.36  0.79 -0.36  0.90 -0.36 -0.36  1.00

正如所见,矩阵中也有重复。为什么?

关于你问题的最后一部分:rcorr按列绑定矩阵sample1sample2,并使用组合矩阵计算秩相关系数。如果你给 sample1 和 sample2 中的基因起不同的名字,例如:

colnames(sample1) <- sprintf('sample1.%s',colnames(sample1))
colnames(sample2) <- sprintf('sample2.%s',colnames(sample2)) 

你会看到你有一个块矩阵,其对角线块对应于每个样本内的系数(sample1-sample1sample2-sample2) , 和非对角线块——sample1sample2.

之间的系数
rcorr(sample1[1:8,],sample2[1:8,],type='s')

              sample1.gene1 sample1.gene2 sample1.gene3 sample1.gene4 sample1.gene5 sample1.gene6 sample1.gene7 sample2.gene1 sample2.gene2 sample2.gene3 sample2.gene4
sample1.gene1          1.00          0.81         -1.00          0.67         -1.00         -1.00          0.36          1.00          0.81         -1.00          0.67
sample1.gene2          0.81          1.00         -0.81          0.95         -0.81         -0.81          0.79          0.81          1.00         -0.81          0.95
sample1.gene3         -1.00         -0.81          1.00         -0.67          1.00          1.00         -0.36         -1.00         -0.81          1.00         -0.67
sample1.gene4          0.67          0.95         -0.67          1.00         -0.67         -0.67          0.90          0.67          0.95         -0.67          1.00
sample1.gene5         -1.00         -0.81          1.00         -0.67          1.00          1.00         -0.36         -1.00         -0.81          1.00         -0.67
sample1.gene6         -1.00         -0.81          1.00         -0.67          1.00          1.00         -0.36         -1.00         -0.81          1.00         -0.67
sample1.gene7          0.36          0.79         -0.36          0.90         -0.36         -0.36          1.00          0.36          0.79         -0.36          0.90
sample2.gene1          1.00          0.81         -1.00          0.67         -1.00         -1.00          0.36          1.00          0.81         -1.00          0.67
sample2.gene2          0.81          1.00         -0.81          0.95         -0.81         -0.81          0.79          0.81          1.00         -0.81          0.95
sample2.gene3         -1.00         -0.81          1.00         -0.67          1.00          1.00         -0.36         -1.00         -0.81          1.00         -0.67
sample2.gene4          0.67          0.95         -0.67          1.00         -0.67         -0.67          0.90          0.67          0.95         -0.67          1.00
sample2.gene5         -1.00         -0.81          1.00         -0.67          1.00          1.00         -0.36         -1.00         -0.81          1.00         -0.67
sample2.gene6         -1.00         -0.81          1.00         -0.67          1.00          1.00         -0.36         -1.00         -0.81          1.00         -0.67
sample2.gene7          0.36          0.79         -0.36          0.90         -0.36         -0.36          1.00          0.36          0.79         -0.36          0.90
              sample2.gene5 sample2.gene6 sample2.gene7
sample1.gene1         -1.00         -1.00          0.36
sample1.gene2         -0.81         -0.81          0.79
sample1.gene3          1.00          1.00         -0.36
sample1.gene4         -0.67         -0.67          0.90
sample1.gene5          1.00          1.00         -0.36
sample1.gene6          1.00          1.00         -0.36
sample1.gene7         -0.36         -0.36          1.00
sample2.gene1         -1.00         -1.00          0.36
sample2.gene2         -0.81         -0.81          0.79
sample2.gene3          1.00          1.00         -0.36
sample2.gene4         -0.67         -0.67          0.90
sample2.gene5          1.00          1.00         -0.36
sample2.gene6          1.00          1.00         -0.36
sample2.gene7         -0.36         -0.36          1.00

碰巧 sample1sample2 在你的例子中是相同的,所以这就是你让所有块矩阵相等的原因。

更新: 可以使用 cor 函数计算 sample1-sample2 相关性:

library(reshape2)

# produce all combinations of column indices for sample1 and sample2
z <- expand.grid(s1=1:7,s2=1:7) 

# due to the correlation matrix symmetry, we can calculate only an upper right trigonal matrix
z <- z[z$s2<z$s1,]

# calculate correlations
z$corr <- mapply(function(i,j) cor(sample1[1:8,i],sample2[1:8,j],method='spearman'),z$s1,z$s2) 

# reshape the result into a trigonal matrix
corr.coefs <- dcast(z,s2~s1,value.var='corr') 

你不能找到两个不同长度的向量之间的相关性,相关性需要成对的数据来计算,它适用于所有相关方法。

不推荐使用缺失值估计和时间序列模型(如 GARCH),因为您使用的是生物数据,不同患者的模式可能不同,这些方法无法考虑可能改变现象的所有因素。

我认为最好的解决方案是删除多余的数据并使两个样本具有相同的患者编号。 R 有一个内置函数 cor。其中的 "use" 参数可以帮助您忽略 NA 值。

这是link:http://www.statmethods.net/stats/correlations.html