如何在缺失值插补中使用 missRanger 的并行计算?
How to use parallel computing for missRanger in imputation of missing values?
我正在通过 missRanger
估算缺失值,这需要很长时间,因为我有 1000 个变量。我尝试使用并行计算,但它并没有使过程更快。这是代码
library(doParallel)
cores=detectCores()
cl <- makeCluster(cores[1]-1)
registerDoParallel(cl)
library(missRanger)
train[1:lengthvar] <- missRanger(train[1:lengthvar], pmm.k = 3, num.trees = 100)
stopCluster(cl)
我不确定要添加到此代码中才能使其正常工作。
这是多核概念的一个基本示例。这将突出基本概念,而不是查看时间问题。通过我的测试运行(对于更多列),非并行版本更快。
library(doParallel)
library(missRanger)
library(data.table) #Needed for rbindlist at the end
cores=detectCores()
cl <- makeCluster(cores[1])
registerDoParallel(cl)
clusterEvalQ(cl, {library(missRanger)}) #Passing the package missRanger to all the cores
#Create some random columns
A=as.numeric(c(1,2,"",4,5,6,7,8,9,10,11,12,13,"",15,16,17,18,19,20))
B=as.numeric(c(120.5,128.1,126.5,122.5,127.1,129.7,124.2,123.7,"",122.3,120.9,122.4,125.7,"",128.2,129.1,121.2,128.4,127.6,125.1))
m = as.data.frame(matrix(0, ncol = 10, nrow = 20))
m[,1:5]=A
m[,6:10]=B
list_num=as.data.frame(seq(1,10,by=1)) #A sequence of column numbers for the different cores to run the function for
#Note that the optimal process would have been to take columns 1:3
#and run it on one core, 4:6 to run it on the 2nd core and so on.
#Function to run on the parallel cores
zzz=function(list_num){
m_new=m[,list_num] #Note the function takes the column number as an argument
m_new=missRanger(m_new[1:length(m_new)], pmm.k = 3, num.trees = 100)
}
clusterExport(cl=cl, list("m"),envir=environment()) #Export your list
zz=parLapply(cl=cl,fun=zzz,X=list_num) #Pass the function and the list of numbers here
zzzz=data.frame(rbindlist(zz)) #rbind the
stopCluster(cl)
missRanger
基于 R -ranger
中的并行随机森林实现。因此,代码已经在所有内核上 运行 而 doParallel
之类的东西只会使代码变得笨拙。
尝试通过 missRanger
的 ...
参数将相关参数传递给 ranger
来加快计算速度,例如
num.trees = 20
或
max.depth = 8
相反。
免责声明:我是 missRanger
.
的作者
我正在通过 missRanger
估算缺失值,这需要很长时间,因为我有 1000 个变量。我尝试使用并行计算,但它并没有使过程更快。这是代码
library(doParallel)
cores=detectCores()
cl <- makeCluster(cores[1]-1)
registerDoParallel(cl)
library(missRanger)
train[1:lengthvar] <- missRanger(train[1:lengthvar], pmm.k = 3, num.trees = 100)
stopCluster(cl)
我不确定要添加到此代码中才能使其正常工作。
这是多核概念的一个基本示例。这将突出基本概念,而不是查看时间问题。通过我的测试运行(对于更多列),非并行版本更快。
library(doParallel)
library(missRanger)
library(data.table) #Needed for rbindlist at the end
cores=detectCores()
cl <- makeCluster(cores[1])
registerDoParallel(cl)
clusterEvalQ(cl, {library(missRanger)}) #Passing the package missRanger to all the cores
#Create some random columns
A=as.numeric(c(1,2,"",4,5,6,7,8,9,10,11,12,13,"",15,16,17,18,19,20))
B=as.numeric(c(120.5,128.1,126.5,122.5,127.1,129.7,124.2,123.7,"",122.3,120.9,122.4,125.7,"",128.2,129.1,121.2,128.4,127.6,125.1))
m = as.data.frame(matrix(0, ncol = 10, nrow = 20))
m[,1:5]=A
m[,6:10]=B
list_num=as.data.frame(seq(1,10,by=1)) #A sequence of column numbers for the different cores to run the function for
#Note that the optimal process would have been to take columns 1:3
#and run it on one core, 4:6 to run it on the 2nd core and so on.
#Function to run on the parallel cores
zzz=function(list_num){
m_new=m[,list_num] #Note the function takes the column number as an argument
m_new=missRanger(m_new[1:length(m_new)], pmm.k = 3, num.trees = 100)
}
clusterExport(cl=cl, list("m"),envir=environment()) #Export your list
zz=parLapply(cl=cl,fun=zzz,X=list_num) #Pass the function and the list of numbers here
zzzz=data.frame(rbindlist(zz)) #rbind the
stopCluster(cl)
missRanger
基于 R -ranger
中的并行随机森林实现。因此,代码已经在所有内核上 运行 而 doParallel
之类的东西只会使代码变得笨拙。
尝试通过 missRanger
的 ...
参数将相关参数传递给 ranger
来加快计算速度,例如
num.trees = 20
或max.depth = 8
相反。
免责声明:我是 missRanger
.