R 插入符号:"non-numeric argument to binary operator" 在 qrf 训练中

R caret: "non-numeric argument to binary operator" in train with qrf

当我 运行 使用 caret::train 的分位数回归森林模型时,出现以下错误:Error in { : task 1 failed - "non-numeric argument to binary operator".

当我将 ntree 设置为更高的数字时(在我的可重现示例中,这将是 ntree = 150),我的代码 运行s 没有错误。

这个代码

library(caret)
library(quantregForest)

data(segmentationData)

dat <- segmentationData[segmentationData$Case == "Train",]
dat <- dat[1:50,]

# predictors
preds <- dat[,c(5:ncol(dat))]

# convert all to numeric
preds <- data.frame(sapply(preds, function(x) as.numeric(as.character(x))))

# response variable
response <- dat[,4]

# set up error measures
sumfct <- function(data, lev = NULL, model = NULL){
  RMSE <- sqrt(mean((data$pred - data$obs)^2, na.omit = TRUE))
  c(RMSE = RMSE)
}


# specify folds
set.seed(42, kind = "Mersenne-Twister", normal.kind = "Inversion")
folds_train <- caret::createMultiFolds(y = dat$Cell,
                                       k = 10,
                                       times = 5)

# specify trainControl for tuning mtry with the created multifolds
finalcontrol <- caret::trainControl(search = "grid", method = "repeatedcv", number = 10, repeats = 5, 
                                    index = folds_train, savePredictions = TRUE, summaryFunction = sumfct)

# build grid for tuning mtry
tunegrid <- expand.grid(mtry = c(2, 10, sqrt(ncol(preds)), ncol(preds)/3))

# train model
set.seed(42, kind = "Mersenne-Twister", normal.kind = "Inversion")
model <- caret::train(x = preds, 
                      y = response,
                      method ="qrf",
                      ntree = 30, # with ntree = 150 it works
                      metric = "RMSE",
                      tuneGrid = tunegrid,
                      trControl = finalcontrol,
                      importance = TRUE,
                      keep.inbag = TRUE
)

产生错误。带有我的真实数据的模型有 ntree = 10000,但任务仍然失败。 我该如何解决这个问题?

在caret的源代码中哪里可以找到错误信息Error in { : task 1 failed - "non-numeric argument to binary operator"的条件?错误信息来自源代码的哪一部分?

更新: 我根据 StupidWolf 的回答用我的真实数据修改了我的代码,所以它看起来像这样:

# train model
set.seed(42, kind = "Mersenne-Twister", normal.kind = "Inversion")
model <- caret::train(x = preds, 
                      y = response,
                      method ="qrf",
                      ntree = 30, # with ntree = 150 it works
                      metric = "RMSE",
                      sampsize = ceiling(length(response)*0.4)
                      tuneGrid = tunegrid,
                      trControl = finalcontrol,
                      importance = TRUE,
                      keep.inbag = FALSE
)

使用我的真实数据,我仍然会收到上述错误消息,因此在最坏的情况下我必须将采样大小调整为 0.1*length(response) 才能成功计算模型。所以只设置 keep.inbag = FALSE 仍然会产生错误。我有多达 1500 个预测变量,而样本(行)的数量只有 50 到 60。我仍然不明白,究竟是什么导致了错误消息。我尝试了没有 sampsize 参数的模型,但总是设置 keep.inbag = FALSE。错误仍在发生。只有将 sampsize 设置得非常低才能确保成功。

如何在不设置 sampsize 的情况下成功 运行 模型?我实际上想要 bootstrap 袋外数据集的方法,而不是我的数据集的 40% 或 10% 的人工样本量来训练森林。

您收到错误是因为您在 quantregforest code 的第 95 行中使用了选项 keep.inbag = TRUE

minoob <- min( apply(!is.na(valuesPredict),1,sum))
if(minoob<10) stop("need to increase number of trees for sufficiently many out-of-bag observations")

因此,它要求您的所有观察结果至少有 10 个 OOB 实例(袋外),以保持袋外预测。因此,如果您的真实数据非常庞大,那么保持安全所需的 ntrees 将会非常庞大​​。

如果您使用插入符来训练数据,那么保留 OOB 并具有 savePredictions = TRUE 似乎是多余的。总的来说,OOB 预测可能没有那么有用,因为无论如何你都会使用测试折叠来预测。

考虑到数据的大小,另一种选择是调整 sampsize。在 randomForest 中,只有 sampsize 个观测值被采样并替换为一棵树。如果为此设置较小的大小,则可确保有足够的 OOB。例如在给出的例子中,我们可以看到:

model <- caret::train(x = preds, 
                      y = response,
                      method ="qrf",
                      ntree = 30, sampsize=17,
                      metric = "RMSE",
                      tuneGrid = tunegrid,
                      trControl = finalcontrol,
                      importance = TRUE,
                      keep.inbag = TRUE)

model
Quantile Random Forest 

50 samples
57 predictors

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 44, 43, 44, 46, 45, 46, ... 
Resampling results across tuning parameters:

  mtry       RMSE    
   2.000000  42.53061
   7.549834  42.72116
  10.000000  43.11533
  19.000000  42.80340

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 2.