通过r中的for循环搜索最佳随机森林参数

Question

大家好，我正在尝试通过for循环搜索最佳参数。然而，结果真的让我很困惑。以下代码应提供相同的结果，因为参数 "mtry" 相同。

       gender Partner   tenure Churn
3521     Male      No 0.992313   Yes
2525.1   Male      No 4.276666    No
567      Male     Yes 2.708050    No
8381   Female      No 4.202127   Yes
6258   Female      No 0.000000   Yes
6569     Male     Yes 2.079442    No
27410  Female      No 1.550804   Yes
6429   Female      No 1.791759   Yes
412    Female     Yes 3.828641    No
4655   Female     Yes 3.737670    No

RFModel = randomForest(Churn ~ .,
                     data = ggg,
                     ntree = 30,
                     mtry = 2,
                     importance = TRUE,
                     replace = FALSE)
print(RFModel$confusion)

    No Yes class.error
No   4   1         0.2
Yes  1   4         0.2

for(i in c(2)){
   RFModel = randomForest(Churn ~ .,
                     data = Trainingds,
                     ntree = 30,
                     mtry = i,
                     importance = TRUE,
                     replace = FALSE)
   print(RFModel$confusion)
}

     No Yes class.error
No   3   2         0.4
Yes  2   3         0.4

Code1 和 code2 应该提供相同的输出。

Answer 1

您每次都会得到略有不同的结果，因为算法内置了随机性。为了构建每棵树，该算法对数据框进行重新采样，并随机选择 mtry 列以从重新采样的数据框构建树。如果您希望使用相同参数（例如 mtry、ntree）构建的模型每次都给出相同的结果，则需要设置随机种子。

例如，让我们运行 randomForest 10 次并检查每个运行的均方误差的平均值。注意每次的均值mse都不一样：

library(randomForest)

replicate(10, mean(randomForest(mpg ~ ., data=mtcars)$mse))

[1] 5.998530 6.307782 5.791657 6.125588 5.868717 5.845616 5.427208 6.112762 5.777624 6.150021

如果您运行上面的代码，您将得到另外 10 个与上面的值不同的值。

如果您希望能够使用相同的参数（例如，mtry 和 ntree）重现给定模型运行的结果，那么您可以设置随机种子。例如：

set.seed(5)
mean(randomForest(mpg ~ ., data=mtcars)$mse)

[1] 6.017737

如果您使用相同的种子值，您会得到相同的结果，否则会得到不同的结果。使用较大的 ntree 值将减少但不会消除模型运行s.

之间的可变性

更新： 当我运行你的代码和你提供的数据样本时，我每次都不会得到相同的结果。即使使用 replace=TRUE，这会导致数据帧在没有替换的情况下被采样，每次选择构建树的列都可能不同：

> randomForest(Churn ~ .,
+              data = ggg,
+              ntree = 30,
+              mtry = 2,
+              importance = TRUE,
+              replace = FALSE)

Call:
 randomForest(formula = Churn ~ ., data = ggg, ntree = 30, mtry = 2,      importance = TRUE, replace = FALSE) 
               Type of random forest: classification
                     Number of trees: 30
No. of variables tried at each split: 2

        OOB estimate of  error rate: 30%
Confusion matrix:
    No Yes class.error
No   3   2         0.4
Yes  1   4         0.2
> randomForest(Churn ~ .,
+              data = ggg,
+              ntree = 30,
+              mtry = 2,
+              importance = TRUE,
+              replace = FALSE)

Call:
 randomForest(formula = Churn ~ ., data = ggg, ntree = 30, mtry = 2,      importance = TRUE, replace = FALSE) 
               Type of random forest: classification
                     Number of trees: 30
No. of variables tried at each split: 2

        OOB estimate of  error rate: 20%
Confusion matrix:
    No Yes class.error
No   4   1         0.2
Yes  1   4         0.2
> randomForest(Churn ~ .,
+              data = ggg,
+              ntree = 30,
+              mtry = 2,
+              importance = TRUE,
+              replace = FALSE)

Call:
 randomForest(formula = Churn ~ ., data = ggg, ntree = 30, mtry = 2,      importance = TRUE, replace = FALSE) 
               Type of random forest: classification
                     Number of trees: 30
No. of variables tried at each split: 2

        OOB estimate of  error rate: 30%
Confusion matrix:
    No Yes class.error
No   3   2         0.4
Yes  1   4         0.2

这是一组与内置 iris 数据框相似的结果：

> randomForest(Species ~ ., data=iris, ntree=30, mtry=2, importance = TRUE,
+              replace = FALSE)

Call:
 randomForest(formula = Species ~ ., data = iris, ntree = 30,      mtry = 2, importance = TRUE, replace = FALSE) 
               Type of random forest: classification
                     Number of trees: 30
No. of variables tried at each split: 2

        OOB estimate of  error rate: 3.33%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         50          0         0        0.00
versicolor      0         47         3        0.06
virginica       0          2        48        0.04
> randomForest(Species ~ ., data=iris, ntree=30, mtry=2, importance = TRUE,
+              replace = FALSE)

Call:
 randomForest(formula = Species ~ ., data = iris, ntree = 30,      mtry = 2, importance = TRUE, replace = FALSE) 
               Type of random forest: classification
                     Number of trees: 30
No. of variables tried at each split: 2

        OOB estimate of  error rate: 4.67%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         50          0         0        0.00
versicolor      0         47         3        0.06
virginica       0          4        46        0.08
> randomForest(Species ~ ., data=iris, ntree=30, mtry=2, importance = TRUE,
+              replace = FALSE)

Call:
 randomForest(formula = Species ~ ., data = iris, ntree = 30,      mtry = 2, importance = TRUE, replace = FALSE) 
               Type of random forest: classification
                     Number of trees: 30
No. of variables tried at each split: 2

        OOB estimate of  error rate: 6%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         50          0         0        0.00
versicolor      0         47         3        0.06
virginica       0          6        44        0.12

您还可以查看每个模型生成的树运行，它们通常会有所不同。例如说我运行以下代码三次，将结果存储在对象 m1、m2 和 m3.

中

randomForest(Churn ~ .,
             data = ggg,
             ntree = 30,
             mtry = 2,
             importance = TRUE,
             replace = FALSE)

现在让我们看看每个模型对象的前四棵树，我已将其粘贴在下面。输出是一个列表。您可以看到每个模型的第一棵树都不同运行。第二棵树与前两个模型运行相同，但第三个不同，依此类推。

check.trees = lapply(1:4, function(i) {
  lapply(list(m1=m1,m2=m2,m3=m3), function(model) getTree(model, i, labelVar=TRUE))
  })

check.trees

[[1]]
[[1]]$m1
  left daughter right daughter split var split point status prediction
1             2              3   Partner    1.000000      1       <NA>
2             4              5    gender    1.000000      1       <NA>
3             0              0      <NA>    0.000000     -1         No
4             0              0      <NA>    0.000000     -1        Yes
5             6              7    tenure    2.634489      1       <NA>
6             0              0      <NA>    0.000000     -1        Yes
7             0              0      <NA>    0.000000     -1         No

[[1]]$m2
  left daughter right daughter split var split point status prediction
1             2              3    gender    1.000000      1       <NA>
2             0              0      <NA>    0.000000     -1        Yes
3             4              5    tenure    1.850182      1       <NA>
4             0              0      <NA>    0.000000     -1        Yes
5             0              0      <NA>    0.000000     -1         No

[[1]]$m3
  left daughter right daughter split var split point status prediction
1             2              3    tenure    2.249904      1       <NA>
2             0              0      <NA>    0.000000     -1        Yes
3             0              0      <NA>    0.000000     -1         No


[[2]]
[[2]]$m1
  left daughter right daughter split var split point status prediction
1             2              3   Partner           1      1       <NA>
2             0              0      <NA>           0     -1        Yes
3             0              0      <NA>           0     -1         No

[[2]]$m2
  left daughter right daughter split var split point status prediction
1             2              3   Partner           1      1       <NA>
2             0              0      <NA>           0     -1        Yes
3             0              0      <NA>           0     -1         No

[[2]]$m3
  left daughter right daughter split var split point status prediction
1             2              3   Partner           1      1       <NA>
2             4              5    gender           1      1       <NA>
3             0              0      <NA>           0     -1         No
4             0              0      <NA>           0     -1        Yes
5             0              0      <NA>           0     -1         No


[[3]]
[[3]]$m1
  left daughter right daughter split var split point status prediction
1             2              3   Partner           1      1       <NA>
2             4              5    gender           1      1       <NA>
3             0              0      <NA>           0     -1         No
4             0              0      <NA>           0     -1        Yes
5             0              0      <NA>           0     -1        Yes

[[3]]$m2
  left daughter right daughter split var split point status prediction
1             2              3   Partner           1      1       <NA>
2             0              0      <NA>           0     -1        Yes
3             0              0      <NA>           0     -1         No

[[3]]$m3
  left daughter right daughter split var split point status prediction
1             2              3    tenure    2.129427      1       <NA>
2             0              0      <NA>    0.000000     -1        Yes
3             0              0      <NA>    0.000000     -1         No


[[4]]
[[4]]$m1
  left daughter right daughter split var split point status prediction
1             2              3    tenure    1.535877      1       <NA>
2             0              0      <NA>    0.000000     -1        Yes
3             4              5    tenure    4.015384      1       <NA>
4             0              0      <NA>    0.000000     -1         No
5             6              7    tenure    4.239396      1       <NA>
6             0              0      <NA>    0.000000     -1        Yes
7             0              0      <NA>    0.000000     -1         No

[[4]]$m2
  left daughter right daughter split var split point status prediction
1             2              3   Partner           1      1       <NA>
2             0              0      <NA>           0     -1        Yes
3             0              0      <NA>           0     -1         No

[[4]]$m3
  left daughter right daughter split var split point status prediction
1             2              3   Partner           1      1       <NA>
2             0              0      <NA>           0     -1        Yes
3             0              0      <NA>           0     -1         No

通过r中的for循环搜索最佳随机森林参数

Searching for best random forest parameter by for loop in r

r

machine-learning

random-forest