如何替换包 randomForest r 中的 bootstrap 步骤
How do I replace the bootstrap step in the package randomForest r
首先是一些背景信息,stats.stackexchange 上可能更有趣:
在我的数据分析中,我尝试比较不同机器学习方法在时间序列数据(回归,而非分类)上的表现。因此,例如,我训练了一个 Boosting 训练模型并将其与随机森林训练模型(R 包 randomForest)进行比较。
我使用时间序列数据,其中解释变量是其他数据和因变量的滞后值。
出于某种原因,随机森林的表现严重不佳。我能想到的问题之一是随机森林对每棵树执行训练数据的采样步骤。如果它对时间序列数据执行此操作,则该序列的自回归性质将被完全消除。
为了测试这个想法,我想用所谓的块 bootstrap 步骤替换 randomForest() 函数中的 (bootstrap) 采样步骤。这基本上意味着我将训练集分成 k 部分,其中 k<<N
,其中每个第 k 部分都是原始顺序。如果我对这 k 个部分进行采样,我仍然可以从随机森林中的 'randomness' 中受益,但时间序列的性质基本保持不变。
现在我的问题是:
为此,我通常会复制现有函数并编辑所需的 step/lines。
randomForest2 <- randomForest()
但是 randomForest() 函数似乎是另一个包装器的包装器,用于更深层次的底层函数。那么我如何才能编辑 randomForest() 函数中的实际 bootstrap 步骤,并且仍然定期 运行 函数的其余部分?
直接改变randomForest(type="reggression")的采样:学习基本的C编程,从cran源代码下载randomForest.4.6-10.tar.gz,(如果windows安装Rtools ), (if OSX install Xcode), 安装并打开 Rstudio, 开始新项目, 选择包, 解压 ...tar.gz 到文件夹, 查看 src 文件夹, 打开 regrf.c,检查第 151 和 163 行。编写新的采样策略,偶尔按 Ctrl+Shift+B 包到 rebuild/compile 并覆盖 randomForest 库,更正声明的编译错误,偶尔测试包是否仍然有效,花几个小时弄清楚旧的无信息代码,可能会更改描述文件、命名空间文件和一些其他参考,因此包将名称更改为 randomForestMod,重建,voilla。
下面介绍了一种更简单的不更改 randomForest 的方法。任何具有相同特征输入的树都可以用函数 randomForest::combine 拼接在一起,因此您可以用纯 R 代码设计采样机制。我认为这实际上是个坏主意,但对于这种非常幼稚的模拟,它实际上具有 similar/slightly 更好的性能!请记住,不要预测绝对目标值,而是预测相对变化、绝对变化等平稳导数。如果预测绝对值,RF 将退回到预测明天与今天非常接近的情况。这是一个微不足道的无用信息。
编辑代码 [22:42 CEST]
library(randomForest)
library(doParallel) #parallel package and mclapply is better for linux
#parallel backend ftw
nCPU = detectCores()
cl = makeCluster(nCPU)
registerDoParallel(cl)
#simulated time series(y) with time roll and lag=1
timepoints=1000;var=6;noise.factor=.2
#past to present orientation
y = sin((1:timepoints)*pi/30) * 1000 +
sin((1:timepoints)*pi/40) * 1000 + 1:timepoints
y = y+rnorm(timepoints,sd=sd(y))*noise.factor
plot(y,type="l")
#convert to absolute change, with lag=1
dy = c(0,y[-1]-y[-length(y)]) # c(0,t2-t1,t3-t2,...)
#compute lag
dy = dy + rnorm(timepoints)*sd(dy)*noise.factor #add noise
dy = c(0,y[-1]-y[-length(y)]) #convert to absolute change, with lag=1
dX = sapply(1:40,function(i){
getTheseLags = (1:timepoints) - i
getTheseLags[getTheseLags<1] = NA #remove before start timePoints
dx.lag.i = dy[getTheseLags]
})
dX[is.na(dX)]=-100 #quick fix of when lag exceed timeseries
pairs(data.frame(dy,dX[,1:5]),cex=.2)#data structure
#make train- and test-set
train=1:600
dy.train = dy[ train]
dy.test = dy[-train]
dX.train = dX[ train,]
dX.test = dX[-train,]
#classic rf
rf = randomForest(dX.train,dy.train,ntree=500)
print(rf)
#like function split for a vector without mixing
split2 = function(aVector,splits=31) {
lVector = length(aVector)
mod = lVector %% splits
lBlocks = rep(floor(lVector/splits),splits)
if(mod!=0) lBlocks[1:mod] = lBlocks[1:mod] + 1
lapply(1:splits,function(i) {
Stop = sum(lBlocks[1:i])
Start = Stop - lBlocks[i] + 1
aVector[Start:Stop]
})
}
nBlocks=10 #combine do not support block of unequal size
rfBlocks = foreach(aBlock = split2(train,splits=nBlocks),
.combine=randomForest::combine,
.packages=("randomForest")) %dopar% {
dXblock = dX.train[aBlock,] ; dyblock = dy.train[aBlock]
rf = randomForest(x=dXblock,y=dyblock,sampsize=length(dyblock),
replace=T,ntree=50)
}
print(rfBlocks)
#predict test, make results table
results = data.frame(predBlock = predict(rfBlocks,newdata=dX.test),
true=dy.test,
predBootstrap = predict(rf,newdata=dX.test))
plot(results[,1:2],xlab="OOB-CV predicted change",
ylab="trueChange",
main="black bootstrap and blue block train")
points(results[,3:2],xlab="OOB-CV predicted change",
ylab="trueChange",
col="blue")
#prediction results
print(cor(results)^2)
stopCluster(cl)#close cluster
所以对我来说,解决方案不是编辑现有的 randomForest 函数。相反,我自己编写了块 bootstrap 代码,使用 Soren H. Welling 提供的 split2
函数来创建块。一旦我的数据块 bootstrapped,我就寻找一个只执行单个回归树并自己聚合它的包 (rpart
)。
我的实际数据的结果是在 RMSPE 方面比正常随机森林性能略有改进的版本。
对于下面的代码,性能似乎是抛硬币。
以索伦的代码为例,看起来有点像这样:
library(randomForest)
library(doParallel) #parallel package and mclapply is better for linux
library(rpart)
#parallel backend ftw
nCPU = detectCores()
cl = makeCluster(nCPU)
registerDoParallel(cl)
#simulated time series(y) with time roll and lag=1
timepoints=1000;var=6;noise.factor=.2
#past to present orientation
y = sin((1:timepoints)*pi/30) * 1000 +
sin((1:timepoints)*pi/40) * 1000 + 1:timepoints
y = y+rnorm(timepoints,sd=sd(y))*noise.factor
plot(y,type="l")
#convert to absolute change, with lag=1
dy = c(0,y[-1]-y[-length(y)]) # c(0,t2-t1,t3-t2,...)
#compute lag
dy = dy + rnorm(timepoints)*sd(dy)*noise.factor #add noise
dy = c(0,y[-1]-y[-length(y)]) #convert to absolute change, with lag=1
dX = sapply(1:40,function(i){
getTheseLags = (1:timepoints) - i
getTheseLags[getTheseLags<1] = NA #remove before start timePoints
dx.lag.i = dy[getTheseLags]
})
dX[is.na(dX)]=-100 #quick fix of when lag exceed timeseries
pairs(data.frame(dy,dX[,1:5]),cex=.2)#data structure
#make train- and test-set
train=1:600
dy.train = dy[ train]
dy.test = dy[-train]
dX.train = dX[ train,]
dX.test = dX[-train,]
#classic rf
rf = randomForest(dX.train,dy.train,ntree=500)
print(rf)
#like function split for a vector without mixing
split2 = function(aVector,splits=31) {
lVector = length(aVector)
mod = lVector %% splits
lBlocks = rep(floor(lVector/splits),splits)
if(mod!=0) lBlocks[1:mod] = lBlocks[1:mod] + 1
lapply(1:splits,function(i) {
Stop = sum(lBlocks[1:i])
Start = Stop - lBlocks[i] + 1
aVector[Start:Stop]
})
}
#create a list of block-wise bootstrapped samples
aBlock <- list()
numTrees <- 500
splits <- 40
for (ttt in 1:numTrees){
aBlock[[ttt]] <- unlist(
sample(
split2(1:nrow(dX.train),splits=splits),
splits,
replace=T
)
)
}
#put data into a dataframe so rpart understands it
df1 <- data.frame(dy.train, dX.train)
#perform regression trees for Blocks
rfBlocks = foreach(aBlock = aBlock,
.packages=("rpart")) %dopar% {
dBlock = df1[aBlock,]
rf = predict( rpart( dy.train ~., data = dBlock, method ="anova" ), newdata=data.frame(dX.test) )
}
#predict test, make results table
#use rowMeans to aggregate the block-wise predictions
results = data.frame(predBlock = rowMeans(do.call(cbind.data.frame, rfBlocks)),
true=dy.test,
predBootstrap = predict(rf,newdata=dX.test)
)
plot(results[,1:2],xlab="OOB-CV predicted change",
ylab="trueChange",
main="black bootstrap and blue block train")
points(results[,3:2],xlab="OOB-CV predicted change",
ylab="trueChange",
col="blue")
#prediction results
print(cor(results)^2)
stopCluster(cl)#close cluster
首先是一些背景信息,stats.stackexchange 上可能更有趣:
在我的数据分析中,我尝试比较不同机器学习方法在时间序列数据(回归,而非分类)上的表现。因此,例如,我训练了一个 Boosting 训练模型并将其与随机森林训练模型(R 包 randomForest)进行比较。
我使用时间序列数据,其中解释变量是其他数据和因变量的滞后值。
出于某种原因,随机森林的表现严重不佳。我能想到的问题之一是随机森林对每棵树执行训练数据的采样步骤。如果它对时间序列数据执行此操作,则该序列的自回归性质将被完全消除。
为了测试这个想法,我想用所谓的块 bootstrap 步骤替换 randomForest() 函数中的 (bootstrap) 采样步骤。这基本上意味着我将训练集分成 k 部分,其中 k<<N
,其中每个第 k 部分都是原始顺序。如果我对这 k 个部分进行采样,我仍然可以从随机森林中的 'randomness' 中受益,但时间序列的性质基本保持不变。
现在我的问题是:
为此,我通常会复制现有函数并编辑所需的 step/lines。
randomForest2 <- randomForest()
但是 randomForest() 函数似乎是另一个包装器的包装器,用于更深层次的底层函数。那么我如何才能编辑 randomForest() 函数中的实际 bootstrap 步骤,并且仍然定期 运行 函数的其余部分?
直接改变randomForest(type="reggression")的采样:学习基本的C编程,从cran源代码下载randomForest.4.6-10.tar.gz,(如果windows安装Rtools ), (if OSX install Xcode), 安装并打开 Rstudio, 开始新项目, 选择包, 解压 ...tar.gz 到文件夹, 查看 src 文件夹, 打开 regrf.c,检查第 151 和 163 行。编写新的采样策略,偶尔按 Ctrl+Shift+B 包到 rebuild/compile 并覆盖 randomForest 库,更正声明的编译错误,偶尔测试包是否仍然有效,花几个小时弄清楚旧的无信息代码,可能会更改描述文件、命名空间文件和一些其他参考,因此包将名称更改为 randomForestMod,重建,voilla。
下面介绍了一种更简单的不更改 randomForest 的方法。任何具有相同特征输入的树都可以用函数 randomForest::combine 拼接在一起,因此您可以用纯 R 代码设计采样机制。我认为这实际上是个坏主意,但对于这种非常幼稚的模拟,它实际上具有 similar/slightly 更好的性能!请记住,不要预测绝对目标值,而是预测相对变化、绝对变化等平稳导数。如果预测绝对值,RF 将退回到预测明天与今天非常接近的情况。这是一个微不足道的无用信息。
编辑代码 [22:42 CEST]
library(randomForest)
library(doParallel) #parallel package and mclapply is better for linux
#parallel backend ftw
nCPU = detectCores()
cl = makeCluster(nCPU)
registerDoParallel(cl)
#simulated time series(y) with time roll and lag=1
timepoints=1000;var=6;noise.factor=.2
#past to present orientation
y = sin((1:timepoints)*pi/30) * 1000 +
sin((1:timepoints)*pi/40) * 1000 + 1:timepoints
y = y+rnorm(timepoints,sd=sd(y))*noise.factor
plot(y,type="l")
#convert to absolute change, with lag=1
dy = c(0,y[-1]-y[-length(y)]) # c(0,t2-t1,t3-t2,...)
#compute lag
dy = dy + rnorm(timepoints)*sd(dy)*noise.factor #add noise
dy = c(0,y[-1]-y[-length(y)]) #convert to absolute change, with lag=1
dX = sapply(1:40,function(i){
getTheseLags = (1:timepoints) - i
getTheseLags[getTheseLags<1] = NA #remove before start timePoints
dx.lag.i = dy[getTheseLags]
})
dX[is.na(dX)]=-100 #quick fix of when lag exceed timeseries
pairs(data.frame(dy,dX[,1:5]),cex=.2)#data structure
#make train- and test-set
train=1:600
dy.train = dy[ train]
dy.test = dy[-train]
dX.train = dX[ train,]
dX.test = dX[-train,]
#classic rf
rf = randomForest(dX.train,dy.train,ntree=500)
print(rf)
#like function split for a vector without mixing
split2 = function(aVector,splits=31) {
lVector = length(aVector)
mod = lVector %% splits
lBlocks = rep(floor(lVector/splits),splits)
if(mod!=0) lBlocks[1:mod] = lBlocks[1:mod] + 1
lapply(1:splits,function(i) {
Stop = sum(lBlocks[1:i])
Start = Stop - lBlocks[i] + 1
aVector[Start:Stop]
})
}
nBlocks=10 #combine do not support block of unequal size
rfBlocks = foreach(aBlock = split2(train,splits=nBlocks),
.combine=randomForest::combine,
.packages=("randomForest")) %dopar% {
dXblock = dX.train[aBlock,] ; dyblock = dy.train[aBlock]
rf = randomForest(x=dXblock,y=dyblock,sampsize=length(dyblock),
replace=T,ntree=50)
}
print(rfBlocks)
#predict test, make results table
results = data.frame(predBlock = predict(rfBlocks,newdata=dX.test),
true=dy.test,
predBootstrap = predict(rf,newdata=dX.test))
plot(results[,1:2],xlab="OOB-CV predicted change",
ylab="trueChange",
main="black bootstrap and blue block train")
points(results[,3:2],xlab="OOB-CV predicted change",
ylab="trueChange",
col="blue")
#prediction results
print(cor(results)^2)
stopCluster(cl)#close cluster
所以对我来说,解决方案不是编辑现有的 randomForest 函数。相反,我自己编写了块 bootstrap 代码,使用 Soren H. Welling 提供的 split2
函数来创建块。一旦我的数据块 bootstrapped,我就寻找一个只执行单个回归树并自己聚合它的包 (rpart
)。
我的实际数据的结果是在 RMSPE 方面比正常随机森林性能略有改进的版本。
对于下面的代码,性能似乎是抛硬币。
以索伦的代码为例,看起来有点像这样:
library(randomForest)
library(doParallel) #parallel package and mclapply is better for linux
library(rpart)
#parallel backend ftw
nCPU = detectCores()
cl = makeCluster(nCPU)
registerDoParallel(cl)
#simulated time series(y) with time roll and lag=1
timepoints=1000;var=6;noise.factor=.2
#past to present orientation
y = sin((1:timepoints)*pi/30) * 1000 +
sin((1:timepoints)*pi/40) * 1000 + 1:timepoints
y = y+rnorm(timepoints,sd=sd(y))*noise.factor
plot(y,type="l")
#convert to absolute change, with lag=1
dy = c(0,y[-1]-y[-length(y)]) # c(0,t2-t1,t3-t2,...)
#compute lag
dy = dy + rnorm(timepoints)*sd(dy)*noise.factor #add noise
dy = c(0,y[-1]-y[-length(y)]) #convert to absolute change, with lag=1
dX = sapply(1:40,function(i){
getTheseLags = (1:timepoints) - i
getTheseLags[getTheseLags<1] = NA #remove before start timePoints
dx.lag.i = dy[getTheseLags]
})
dX[is.na(dX)]=-100 #quick fix of when lag exceed timeseries
pairs(data.frame(dy,dX[,1:5]),cex=.2)#data structure
#make train- and test-set
train=1:600
dy.train = dy[ train]
dy.test = dy[-train]
dX.train = dX[ train,]
dX.test = dX[-train,]
#classic rf
rf = randomForest(dX.train,dy.train,ntree=500)
print(rf)
#like function split for a vector without mixing
split2 = function(aVector,splits=31) {
lVector = length(aVector)
mod = lVector %% splits
lBlocks = rep(floor(lVector/splits),splits)
if(mod!=0) lBlocks[1:mod] = lBlocks[1:mod] + 1
lapply(1:splits,function(i) {
Stop = sum(lBlocks[1:i])
Start = Stop - lBlocks[i] + 1
aVector[Start:Stop]
})
}
#create a list of block-wise bootstrapped samples
aBlock <- list()
numTrees <- 500
splits <- 40
for (ttt in 1:numTrees){
aBlock[[ttt]] <- unlist(
sample(
split2(1:nrow(dX.train),splits=splits),
splits,
replace=T
)
)
}
#put data into a dataframe so rpart understands it
df1 <- data.frame(dy.train, dX.train)
#perform regression trees for Blocks
rfBlocks = foreach(aBlock = aBlock,
.packages=("rpart")) %dopar% {
dBlock = df1[aBlock,]
rf = predict( rpart( dy.train ~., data = dBlock, method ="anova" ), newdata=data.frame(dX.test) )
}
#predict test, make results table
#use rowMeans to aggregate the block-wise predictions
results = data.frame(predBlock = rowMeans(do.call(cbind.data.frame, rfBlocks)),
true=dy.test,
predBootstrap = predict(rf,newdata=dX.test)
)
plot(results[,1:2],xlab="OOB-CV predicted change",
ylab="trueChange",
main="black bootstrap and blue block train")
points(results[,3:2],xlab="OOB-CV predicted change",
ylab="trueChange",
col="blue")
#prediction results
print(cor(results)^2)
stopCluster(cl)#close cluster