使插入符号的遗传特征选择更快
Make caret's genetric feature selection faster
我正在尝试通过插入符号遗传算法使用特征选择来优化 xgboost 树
results <- gafs(iris[,1:4], iris[,5],
iters = 2,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions=caretGA, method="cv", repeats=2, verbose = TRUE),
trConrol = trainControl(method = "cv", classProbs = TRUE, verboseIter = TRUE)
)
然而,这非常慢,即使我只是使用 iters = 2
而不是 iters = 200
更合适。我该怎么做才能让它更快?
这是一个使用 doParallel
包并行化 gafs()
函数并修改其他一些参数以使其更快的示例。在可能的情况下,我包括 运行 次。
原始代码使用的是交叉验证(method = "cv"
)而不是重复交叉验证(method = "repeatedcv"
),所以我认为repeats = 2
参数被忽略了。我没有在并行示例中包含该选项。
首先,使用原始代码,没有任何修改或并行化:
> library(caret)
> data(iris)
> set.seed(1)
> st.01 <- system.time(results.01 <- gafs(iris[,1:4], iris[,5],
iters = 2,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions = caretGA,
method = "cv",
repeats = 2,
verbose = TRUE),
trConrol = trainControl(method = "cv",
classProbs = TRUE,
verboseIter = TRUE)))
Fold01 1 0.9596575 (1)
Fold01 2 0.9596575->0.9667641 (1->1, 100.0%) *
Fold02 1 0.9598146 (1)
Fold02 2 0.9598146->0.9641482 (1->1, 100.0%) *
Fold03 1 0.9502661 (1)
我 运行 通宵(8 到 10 小时)编写了上述代码,但 运行ning 停止了它,因为完成时间太长。 运行 时间的粗略估计至少为 24 小时。
第二个,包括减少的 popSize
参数(从 50 到 20),allowParallel
和 genParallel
选项到 gafsControl()
最后在 gafsControl()
和 trControl()
中减少了 number
的折叠(从 10 到 5):
> library(doParallel)
> cl <- makePSOCKcluster(detectCores() - 1)
> registerDoParallel(cl)
> set.seed(1)
> st.09 <- system.time(results.09 <- gafs(iris[,1:4], iris[,5],
iters = 2,
popSize = 20,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions = caretGA,
method = "cv",
number = 5,
verbose = TRUE,
allowParallel = TRUE,
genParallel = TRUE),
trConrol = trainControl(method = "cv",
number = 5,
classProbs = TRUE,
verboseIter = TRUE)))
final GA
1 0.9508099 (4)
2 0.9508099->0.9561501 (4->1, 25.0%) *
final model
> st.09
user system elapsed
3.536 0.173 4152.988
我的系统有 4 个内核,但按照规定它只使用了 3 个,我确认它是 运行宁 3 个 R 进程。
gafsControl()
文档对 allowParallel
和 genParallel
的描述如下:
allowParallel
:如果并行后端已加载且可用,
函数应该使用它吗?
genParallel
:如果并行后端已加载并可用,应该
'gafs' 使用它并行化适应度计算
重采样中的一代?
插入符号文档表明 allowParallel
选项将比 genParallel
选项提供更大的 运行 时间改进:
https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html
与原始代码相比,我预计并行化代码的结果至少会略有不同。以下是并行代码的结果:
> results.09
Genetic Algorithm Feature Selection
150 samples
4 predictors
3 classes: 'setosa', 'versicolor', 'virginica'
Maximum generations: 2
Population per generation: 20
Crossover probability: 0.8
Mutation probability: 0.1
Elitism: 0
Internal performance values: Accuracy, Kappa
Subset selection driven to maximize internal Accuracy
External performance values: Accuracy, Kappa
Best iteration chose by maximizing external Accuracy
External resampling method: Cross-Validated (5 fold)
During resampling:
* the top 4 selected variables (out of a possible 4):
Petal.Width (80%), Petal.Length (40%), Sepal.Length (20%), Sepal.Width (20%)
* on average, 1.6 variables were selected (min = 1, max = 4)
In the final search using the entire training set:
* 4 features selected at iteration 1 including:
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
* external performance at this iteration is
Accuracy Kappa
0.9467 0.9200
我正在尝试通过插入符号遗传算法使用特征选择来优化 xgboost 树
results <- gafs(iris[,1:4], iris[,5],
iters = 2,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions=caretGA, method="cv", repeats=2, verbose = TRUE),
trConrol = trainControl(method = "cv", classProbs = TRUE, verboseIter = TRUE)
)
然而,这非常慢,即使我只是使用 iters = 2
而不是 iters = 200
更合适。我该怎么做才能让它更快?
这是一个使用 doParallel
包并行化 gafs()
函数并修改其他一些参数以使其更快的示例。在可能的情况下,我包括 运行 次。
原始代码使用的是交叉验证(method = "cv"
)而不是重复交叉验证(method = "repeatedcv"
),所以我认为repeats = 2
参数被忽略了。我没有在并行示例中包含该选项。
首先,使用原始代码,没有任何修改或并行化:
> library(caret)
> data(iris)
> set.seed(1)
> st.01 <- system.time(results.01 <- gafs(iris[,1:4], iris[,5],
iters = 2,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions = caretGA,
method = "cv",
repeats = 2,
verbose = TRUE),
trConrol = trainControl(method = "cv",
classProbs = TRUE,
verboseIter = TRUE)))
Fold01 1 0.9596575 (1)
Fold01 2 0.9596575->0.9667641 (1->1, 100.0%) *
Fold02 1 0.9598146 (1)
Fold02 2 0.9598146->0.9641482 (1->1, 100.0%) *
Fold03 1 0.9502661 (1)
我 运行 通宵(8 到 10 小时)编写了上述代码,但 运行ning 停止了它,因为完成时间太长。 运行 时间的粗略估计至少为 24 小时。
第二个,包括减少的 popSize
参数(从 50 到 20),allowParallel
和 genParallel
选项到 gafsControl()
最后在 gafsControl()
和 trControl()
中减少了 number
的折叠(从 10 到 5):
> library(doParallel)
> cl <- makePSOCKcluster(detectCores() - 1)
> registerDoParallel(cl)
> set.seed(1)
> st.09 <- system.time(results.09 <- gafs(iris[,1:4], iris[,5],
iters = 2,
popSize = 20,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions = caretGA,
method = "cv",
number = 5,
verbose = TRUE,
allowParallel = TRUE,
genParallel = TRUE),
trConrol = trainControl(method = "cv",
number = 5,
classProbs = TRUE,
verboseIter = TRUE)))
final GA
1 0.9508099 (4)
2 0.9508099->0.9561501 (4->1, 25.0%) *
final model
> st.09
user system elapsed
3.536 0.173 4152.988
我的系统有 4 个内核,但按照规定它只使用了 3 个,我确认它是 运行宁 3 个 R 进程。
gafsControl()
文档对 allowParallel
和 genParallel
的描述如下:
allowParallel
:如果并行后端已加载且可用, 函数应该使用它吗?genParallel
:如果并行后端已加载并可用,应该 'gafs' 使用它并行化适应度计算 重采样中的一代?
插入符号文档表明 allowParallel
选项将比 genParallel
选项提供更大的 运行 时间改进:
https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html
与原始代码相比,我预计并行化代码的结果至少会略有不同。以下是并行代码的结果:
> results.09
Genetic Algorithm Feature Selection
150 samples
4 predictors
3 classes: 'setosa', 'versicolor', 'virginica'
Maximum generations: 2
Population per generation: 20
Crossover probability: 0.8
Mutation probability: 0.1
Elitism: 0
Internal performance values: Accuracy, Kappa
Subset selection driven to maximize internal Accuracy
External performance values: Accuracy, Kappa
Best iteration chose by maximizing external Accuracy
External resampling method: Cross-Validated (5 fold)
During resampling:
* the top 4 selected variables (out of a possible 4):
Petal.Width (80%), Petal.Length (40%), Sepal.Length (20%), Sepal.Width (20%)
* on average, 1.6 variables were selected (min = 1, max = 4)
In the final search using the entire training set:
* 4 features selected at iteration 1 including:
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
* external performance at this iteration is
Accuracy Kappa
0.9467 0.9200