有效地将条件应用于矩阵
Efficiently applying conditions to a matrix
我有一个 integer
matrix
:
set.seed(1)
counts.mat <- matrix(sample(50,29*10,replace=T),nrow=10,ncol=29)
colnames(counts.mat) <- c("ww.1m_1","ww.1m_2","wm.1m_1","wm.1m_2","wm.1m_3","wn.1m_1","wn.1m_2",
"A_1","A_2","B_1","B_2","C_1","C_2",
"ww.2m_1","ww.2m_2","ww.2m_3","wm.2m_1","wm.2m_2","wn.2m_1","wn.2m_2",
"ww.3m_1","ww.3m_2","ww.3m_3","wm.3m_1","wm.3m_2","wm.3m_3","wn.3m_1","wn.3m_2","wn.3m_3")
其元素表示从一组实验(在本例中为 3)中获取的特定测量值的计数,这些实验在 data.frame
的 list
中进行了描述:
df.list <- list(df1 = data.frame(gt1=c("ww.1m","wm.1m","wn.1m"),kt1=c("A","B","C"),stringsAsFactors=F),
df2 = data.frame(gt2=c("ww.2m","wm.2m","wn.2m"),stringsAsFactors=F),
df3 = data.frame(gt2=c("ww.3m","wm.3m","wn.3m"),stringsAsFactors=F))
df.list
中每个data.frame
的列是其对应实验的因子,该列的值是因子水平。 counts.mat
的 colnames
是这些因子水平的重复项,它们的名称遵循以下格式:
<factor.level>_<replicate>
。
这对应于 df.list
。
例如df.list$df1
中的gt1
是一个具有水平的因子:
"ww.1m" "wm.1m" "wn.1m"
其在 counts.mat
中各自的重复是:
"ww.1m_1","ww.1m_2","wm.1m_1","wm.1m_2","wm.1m_3","wn.1m_1","wn.1m_2"
给定:
min.replicates <- 1
min.counts <- 10
我想做的是每个因素(列),在每个 data.frame
in df.list
return TRUE
or FALSE
if at least min.replicates
或更多 counts.mat
中的每一行至少有 min.counts
或更多。
结果应为 matrix
,其中列数等于 df.list
的因子水平总数,行数等于 counts.mat
的行数.
我认为这是一个缓慢的实现:
res.mat <- do.call(rbind,lapply(1:nrow(counts.mat),function(i){
return(do.call(cbind,lapply(1:length(df.list),function(l){
return(do.call(cbind,lapply(1:ncol(df.list[[l]]),function(j){
return(do.call(cbind,lapply(1:nrow(df.list[[l]]),function(k){
return(length(which(counts.mat[i,which(grepl(paste0(df.list[[l]][k,j],"_\d+$"),colnames(counts.mat),perl=T))] >= min.counts)) >= min.replicates)
})))
})))
})))
}))
所以我正在寻找明显更快的东西。
我认为这做同样的事情,而且应该更快...
dfcols <- unlist(df.list) #extract list of columns required as a vector
matcols <- lapply(dfcols,function(x) which(startsWith(colnames(counts.mat),x))) #match to matrix columns
resmat <- sapply(1:length(dfcols),function(i)
apply(counts.mat[,matcols[[i]]],1,function(y) sum(y>=min.count) >= min.replicates))
colnames(resmat) <- dfcols #set colnames in output
在我上面的评论中进行了更正,并将 min.replicates
设置为 30(如果是 10,则所有元素都是 TRUE
,以您的示例为例),这给出了...
resmat
ww.1m wm.1m wn.1m A B C ww.2m wm.2m wn.2m ww.3m wm.3m wn.3m
[1,] FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
[2,] FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE
[3,] TRUE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
[4,] TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
[5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
[6,] TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE
[7,] TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE
[8,] TRUE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE
[9,] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE
[10,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
我有一个 integer
matrix
:
set.seed(1)
counts.mat <- matrix(sample(50,29*10,replace=T),nrow=10,ncol=29)
colnames(counts.mat) <- c("ww.1m_1","ww.1m_2","wm.1m_1","wm.1m_2","wm.1m_3","wn.1m_1","wn.1m_2",
"A_1","A_2","B_1","B_2","C_1","C_2",
"ww.2m_1","ww.2m_2","ww.2m_3","wm.2m_1","wm.2m_2","wn.2m_1","wn.2m_2",
"ww.3m_1","ww.3m_2","ww.3m_3","wm.3m_1","wm.3m_2","wm.3m_3","wn.3m_1","wn.3m_2","wn.3m_3")
其元素表示从一组实验(在本例中为 3)中获取的特定测量值的计数,这些实验在 data.frame
的 list
中进行了描述:
df.list <- list(df1 = data.frame(gt1=c("ww.1m","wm.1m","wn.1m"),kt1=c("A","B","C"),stringsAsFactors=F),
df2 = data.frame(gt2=c("ww.2m","wm.2m","wn.2m"),stringsAsFactors=F),
df3 = data.frame(gt2=c("ww.3m","wm.3m","wn.3m"),stringsAsFactors=F))
df.list
中每个data.frame
的列是其对应实验的因子,该列的值是因子水平。 counts.mat
的 colnames
是这些因子水平的重复项,它们的名称遵循以下格式:
<factor.level>_<replicate>
。
这对应于 df.list
。
例如df.list$df1
中的gt1
是一个具有水平的因子:
"ww.1m" "wm.1m" "wn.1m"
其在 counts.mat
中各自的重复是:
"ww.1m_1","ww.1m_2","wm.1m_1","wm.1m_2","wm.1m_3","wn.1m_1","wn.1m_2"
给定:
min.replicates <- 1
min.counts <- 10
我想做的是每个因素(列),在每个 data.frame
in df.list
return TRUE
or FALSE
if at least min.replicates
或更多 counts.mat
中的每一行至少有 min.counts
或更多。
结果应为 matrix
,其中列数等于 df.list
的因子水平总数,行数等于 counts.mat
的行数.
我认为这是一个缓慢的实现:
res.mat <- do.call(rbind,lapply(1:nrow(counts.mat),function(i){
return(do.call(cbind,lapply(1:length(df.list),function(l){
return(do.call(cbind,lapply(1:ncol(df.list[[l]]),function(j){
return(do.call(cbind,lapply(1:nrow(df.list[[l]]),function(k){
return(length(which(counts.mat[i,which(grepl(paste0(df.list[[l]][k,j],"_\d+$"),colnames(counts.mat),perl=T))] >= min.counts)) >= min.replicates)
})))
})))
})))
}))
所以我正在寻找明显更快的东西。
我认为这做同样的事情,而且应该更快...
dfcols <- unlist(df.list) #extract list of columns required as a vector
matcols <- lapply(dfcols,function(x) which(startsWith(colnames(counts.mat),x))) #match to matrix columns
resmat <- sapply(1:length(dfcols),function(i)
apply(counts.mat[,matcols[[i]]],1,function(y) sum(y>=min.count) >= min.replicates))
colnames(resmat) <- dfcols #set colnames in output
在我上面的评论中进行了更正,并将 min.replicates
设置为 30(如果是 10,则所有元素都是 TRUE
,以您的示例为例),这给出了...
resmat
ww.1m wm.1m wn.1m A B C ww.2m wm.2m wn.2m ww.3m wm.3m wn.3m
[1,] FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
[2,] FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE
[3,] TRUE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
[4,] TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
[5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
[6,] TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE
[7,] TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE
[8,] TRUE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE
[9,] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE
[10,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE