循环遍历 A 中所有行并关联 B 中所有列的智能方法

Question

首先post，所以温柔点;-)

我有一个场景，我想将垫子 A 的所有行（大约 50,000）与垫子 B 的所有列（大约 100）相关联。我已经通过这样做解决了这个问题：

output = c()
for( i in 1:nrow(A) ){
    for(j in 1:ncol(B) ){
        myTest = cor.test(A[i,],B[,j],method="spearman")
        output = rbind(output,c(rownames(A)[i],colnames(B)[j],
                                myTest$p.value,myTest$estimate))
    }
}

但是慢得无可救药，运行了30个小时，还是没写完。

一定有更聪明的方法吗？ :-)

干杯！

Answer 1

你的代码很慢主要是因为你做 rbind，它创建了一个新矩阵并复制了前一个矩阵的所有数据。这会产生巨大的开销。

一个简单的解决方案是在循环之前创建矩阵，然后填充它：

output = matrix(0, nrow=nrow(A)*ncol(B), ncol=4) for(i in 1:nrow(A)){ for(j in 1:ncol(B) ){ myTest = cor.test(A[i,],B[,j],method="spearman") output[(i-1)*ncol(B)+j,] = c(rownames(A)[i],colnames(B)[j], myTest$p.value,myTest$estimate) } }

Answer 2

好的，所以我尝试了@Math 的建议并决定使用以下代码计时：

# Clear workspace
rm(list = ls())

# Reproducible results
set.seed(42)

# Set dimensions
n1 = 500
n2 = 150
n3 = 100

# Create matrices
A = matrix(rnorm(n1*n2),nrow=n1,ncol=n2)
B = matrix(rnorm(n2*n3),nrow=n2,ncol=n3)

# Assign row/col names
rownames(A)=paste("Arow",seq(1,nrow(A)),sep="")
colnames(A)=paste("Acol",seq(1,ncol(A)),sep="")
rownames(B)=paste("Brow",seq(1,nrow(B)),sep="")
colnames(B)=paste("Bcol",seq(1,ncol(B)),sep="")

# State number of correlations to be performed
cat(paste("Total number of correlations =",nrow(A)*ncol(B),"\n"))

# Test 1 using rbind()
cat("Starting test 1 with rbind()\n")
ptm = proc.time()
output = c()
for( i in 1:nrow(A) ){
    for( j in 1:ncol(B) ){
        myTest = cor.test(A[i,],B[,j],method="spearman")
        output = rbind(output,c(rownames(A)[i],colnames(B)[j],
                                myTest$p.value,myTest$estimate))
    }
}
print(proc.time() - ptm)

# Test 2 using pre-built matrix
cat("Starting test 2 with pre-built matrix\n")
ptm = proc.time()
output = matrix(0, nrow=nrow(A)*ncol(B), ncol=4)
count  = 1
for( i in 1:nrow(A) ){
    for( j in 1:ncol(B) ){
        myTest = cor.test(A[i,],B[,j],method="spearman")
        output[count,] = c(rownames(A)[i],colnames(B)[j],
                           myTest$p.value,myTest$estimate)
        count = count + 1
    }
}
print(proc.time() - ptm)

运行此代码产生以下结果：

Total number of correlations = 50000 
Starting test 1 with rbind()
   user  system elapsed 
275.560   6.963 282.913 
Starting test 2 with pre-built matrix
   user  system elapsed 
 29.869   0.218  30.114

所以显然有很大的不同，交流。我不知道这个 'problem' 使用 rbind() 函数逐渐构建矩阵。感谢@Math 指出这一点！ :-)

干杯！

循环遍历 A 中所有行并关联 B 中所有列的智能方法

Smart way to loop over all rows in A and correlate with all columns in B

performance

for-loop

r

matrix

correlation