为什么指定 sampsize 不会加速 randomForest？

Question

我正在尝试使用包 randomForest 运行在 R 中对 this large dataset 进行随机森林回归。我已经运行解决了所需计算时间的问题，即使在与 doSNOW 和 10-20 个内核并行化时也是如此。我想我误解了函数 randomForest 中的 "sampsize" 参数。当我将数据集子集化为 100,000 行时，我可以在 9-10 秒内构建一棵树。

training = read.csv("training.csv")
t100K = sample_n(training, 100000)
system.time(randomForest(tree~., data=t100K, ntree=1, importance=T)) #~10sec

但是，当我在运行ning randomForest 的过程中使用 sampsize 参数从完整数据集中采样 100,000 行时，同样的 1 棵树需要几个小时。

system.time(randomForest(tree~., data=training, sampsize = ifelse(nrow(training<100000),nrow(training), 100000), ntree=1, importance=T)) #>>100x as long. Why?

显然，我最终会运行 >>1 棵树。我在这里错过了什么？谢谢

Answer 1

你的括号有点偏了。请注意以下语句之间的区别。您目前拥有：

ifelse(nrow(mtcars<10),nrow(mtcars), 10)

它计算布尔矩阵 mtcars<10 中的行数，mtcars 中每个小于 10 的元素具有 TRUE，否则为 FALSE。你想要：

ifelse(nrow(mtcars)<10,nrow(mtcars), 10)

希望对您有所帮助。

为什么指定 sampsize 不会加速 randomForest？

Why does specifying sampsize not speed up randomForest?

regression

r

sample

machine-learning

random-forest