xgbTree 插入符矩阵与否?
xgbTree caret matrix or not?
我是运行 例如下面的代码:
v.ctrl <- trainControl(method = "repeatedcv", repeats = 1,number = 3,
summaryFunction = twoClassSummary,
classProbs = TRUE,
allowParallel=T)
xgb.grid <- expand.grid(nrounds = 10000,
eta = c(0.01,0.05,0.1),
max_depth = c(2,4,6,8,10,14))
set.seed(45)
xgb_tune <-train(target~.,
data = train,
method = "xgbTree",
trControl = cv.ctrl,
tuneGrid = xgb.grid,
verbose = TRUE,
metric = "LogLoss",
nthread = 3)
错误很简单:
Error in train(target ~ ., data = train, method = "xgbTree", trControl = cv.ctrl, :
unused arguments (data = train, method = "xgbTree", trControl = cv.ctrl, tuneGrid = xgb.grid, verbose = T, metric = "LogLoss", nthread = 3)
我的数据集
structure(list(feature19 = c(0.58776, 0.40764, 0.4708, 0.67577, 0.41681, 0.5291, 0.33197, 0.24138, 0.49776, 0.58293), feature6 = c(0.48424, 0.48828, 0.58975, 0.33185, 0.6917, 0.53813, 0.76235, 0.7036, 0.33871, 0.51928), feature10 = c(0.61347, 0.65801, 0.69926, 0.23311, 0.8134, 0.55321, 0.72926, 0.663, 0.49206, 0.55531), feature20 = c(0.39615, 0.49085, 0.50274, 0.6038, 0.37487, 0.53582, 0.62004, 0.63819, 0.37858, 0.40478), feature7 = c(0.55901, 0.38715, 0.50705, 0.76004, 0.3207, 0.54697, 0.31014, 0.21932, 0.4831, 0.52253), feature4 = c(0.5379, 0.52526, 0.44264, 0.28974, 0.65142, 0.41382, 0.44205, 0.47272, 0.6303, 0.56405), feature16 = c(0.41849, 0.45628, 0.37617, 0.39334, 0.46727, 0.36297, 0.3054, 0.41256, 0.6302, 0.41892), feature2 = c(0.62194, 0.5555, 0.61301, 0.27452, 0.74148, 0.49785, 0.5215, 0.46492, 0.54834, 0.58106), feature21 = c(0.32122, 0.37679, 0.35889, 0.74368, 0.18306, 0.47027, 0.40567, 0.47801, 0.41617, 0.35244), feature12 = c(0.56532, 0.55707, 0.49138, 0.24911, 0.69341, 0.42176, 0.41445, 0.45535, 0.62379, 0.5523), target = c(1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 1L)), .Names = c("feature19", "feature6", "feature10", "feature20", "feature7", "feature4", "feature16", "feature2", "feature21", "feature12", "target"), row.names = c(NA, 10L), class = "data.frame")
有谁知道我是否必须重新处理 xgbtree 的数据?
谢谢你!
我意识到在 R/Caret/machine 学习方面我有点菜鸟,但在不断检查对我的问题的回答后我看到了你的 post 并且我设法让你的代码工作。我希望更有知识的人能够完全回答你的问题,但与此同时,这就是我所做的。
首先,我将您的数据集输入到 R 中并尝试了 运行 您的代码。我相信您的控制函数中可能有错字,您在 "cv" 中缺少 "c",这可能会导致您遇到未使用参数的问题。
然而,在我解决了那个问题之后,出现了多个错误和警告;一方面,您使用的是 twoClassSummary 但指定了 logLoss(请注意此处的语法,它不是 LogLoss 以防发生任何更改)...相反,我将此 summaryFunction 切换为 mnlog 以正确调用 logLoss 函数,正如我读过的 twoClassSummary使用 AUC 作为其指标。另外,我用一个简单的字符变量替换了训练集中的 "target" 变量,在本例中为 "Y" 或 "N"。您可以下载 csv 文件 here.
之后,我一直收到有关您的调整网格的错误,指出您基本上缺少 xgBoost 方法的调整参数,这些参数可以在插入符号(可用模型)的文档中找到。我只是为其余参数添加了默认值(其中大部分为 1)。我使用的调整网格可以在 here.
找到
我用来实际训练xgb模型的最终代码如下:
control = trainControl(method = "repeatedcv", repeats = 1, number = 3,
summaryFunction = mnLogLoss,
classProbs = TRUE,
allowParallel=T)
tune = train(x=set[,1:10], y=set[,11], method="xgbTree", trControl=control,
tuneGrid = xgb.grid, verbose=TRUE, metric="logLoss", nthread=3)
输出如下所示:
tune
eXtreme Gradient Boosting
10 samples
10 predictors
2 classes: 'N', 'Y'
No pre-processing
Resampling: Cross-Validated (3 fold, repeated 1 times)
Summary of sample sizes: 6, 8, 6
Resampling results across tuning parameters:
eta max_depth logLoss
0.01 2 0.6914816
0.01 4 0.6914816
0.01 6 0.6914816
0.01 8 0.6914816
0.01 10 0.6914816
0.01 14 0.6914816
0.05 2 0.6848399
0.05 4 0.6848399
0.05 6 0.6848399
0.05 8 0.6848399
0.05 10 0.6848399
0.05 14 0.6848399
0.10 2 0.6765847
0.10 4 0.6765847
0.10 6 0.6765847
0.10 8 0.6765847
0.10 10 0.6765847
0.10 14 0.6765847
Tuning parameter 'nrounds' was held constant at a value of 10000
Tuning parameter 'gamma' was held constant at a
value of 0
Tuning parameter 'colsample_bytree' was held constant at a value of 1
Tuning parameter
'min_child_weight' was held constant at a value of 1
Tuning parameter 'subsample' was held constant at a value of 1
logLoss was used to select the optimal model using the smallest value.
The final values used for the model were nrounds = 10000, max_depth = 2, eta
= 0.1, gamma = 0, colsample_bytree =
1, min_child_weight = 1 and subsample = 1.
希望这对您有所帮助,并且正是您所寻求的。我有点怀疑我是否正确执行了日志丢失命令,因为它看起来最大深度实际上对日志丢失没有影响。我使用不同的指标 AUC 重新运行模型,无论更改什么,结果都没有显示任何效果,这与 Cohen 的 Kappa 相同。我猜这是因为只有十个样本,但希望有人能真正解释我所做的,所以这不仅仅是代码转储。
我是运行 例如下面的代码:
v.ctrl <- trainControl(method = "repeatedcv", repeats = 1,number = 3,
summaryFunction = twoClassSummary,
classProbs = TRUE,
allowParallel=T)
xgb.grid <- expand.grid(nrounds = 10000,
eta = c(0.01,0.05,0.1),
max_depth = c(2,4,6,8,10,14))
set.seed(45)
xgb_tune <-train(target~.,
data = train,
method = "xgbTree",
trControl = cv.ctrl,
tuneGrid = xgb.grid,
verbose = TRUE,
metric = "LogLoss",
nthread = 3)
错误很简单:
Error in train(target ~ ., data = train, method = "xgbTree", trControl = cv.ctrl, : unused arguments (data = train, method = "xgbTree", trControl = cv.ctrl, tuneGrid = xgb.grid, verbose = T, metric = "LogLoss", nthread = 3)
我的数据集
structure(list(feature19 = c(0.58776, 0.40764, 0.4708, 0.67577, 0.41681, 0.5291, 0.33197, 0.24138, 0.49776, 0.58293), feature6 = c(0.48424, 0.48828, 0.58975, 0.33185, 0.6917, 0.53813, 0.76235, 0.7036, 0.33871, 0.51928), feature10 = c(0.61347, 0.65801, 0.69926, 0.23311, 0.8134, 0.55321, 0.72926, 0.663, 0.49206, 0.55531), feature20 = c(0.39615, 0.49085, 0.50274, 0.6038, 0.37487, 0.53582, 0.62004, 0.63819, 0.37858, 0.40478), feature7 = c(0.55901, 0.38715, 0.50705, 0.76004, 0.3207, 0.54697, 0.31014, 0.21932, 0.4831, 0.52253), feature4 = c(0.5379, 0.52526, 0.44264, 0.28974, 0.65142, 0.41382, 0.44205, 0.47272, 0.6303, 0.56405), feature16 = c(0.41849, 0.45628, 0.37617, 0.39334, 0.46727, 0.36297, 0.3054, 0.41256, 0.6302, 0.41892), feature2 = c(0.62194, 0.5555, 0.61301, 0.27452, 0.74148, 0.49785, 0.5215, 0.46492, 0.54834, 0.58106), feature21 = c(0.32122, 0.37679, 0.35889, 0.74368, 0.18306, 0.47027, 0.40567, 0.47801, 0.41617, 0.35244), feature12 = c(0.56532, 0.55707, 0.49138, 0.24911, 0.69341, 0.42176, 0.41445, 0.45535, 0.62379, 0.5523), target = c(1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 1L)), .Names = c("feature19", "feature6", "feature10", "feature20", "feature7", "feature4", "feature16", "feature2", "feature21", "feature12", "target"), row.names = c(NA, 10L), class = "data.frame")
有谁知道我是否必须重新处理 xgbtree 的数据? 谢谢你!
我意识到在 R/Caret/machine 学习方面我有点菜鸟,但在不断检查对我的问题的回答后我看到了你的 post 并且我设法让你的代码工作。我希望更有知识的人能够完全回答你的问题,但与此同时,这就是我所做的。
首先,我将您的数据集输入到 R 中并尝试了 运行 您的代码。我相信您的控制函数中可能有错字,您在 "cv" 中缺少 "c",这可能会导致您遇到未使用参数的问题。
然而,在我解决了那个问题之后,出现了多个错误和警告;一方面,您使用的是 twoClassSummary 但指定了 logLoss(请注意此处的语法,它不是 LogLoss 以防发生任何更改)...相反,我将此 summaryFunction 切换为 mnlog 以正确调用 logLoss 函数,正如我读过的 twoClassSummary使用 AUC 作为其指标。另外,我用一个简单的字符变量替换了训练集中的 "target" 变量,在本例中为 "Y" 或 "N"。您可以下载 csv 文件 here.
之后,我一直收到有关您的调整网格的错误,指出您基本上缺少 xgBoost 方法的调整参数,这些参数可以在插入符号(可用模型)的文档中找到。我只是为其余参数添加了默认值(其中大部分为 1)。我使用的调整网格可以在 here.
找到我用来实际训练xgb模型的最终代码如下:
control = trainControl(method = "repeatedcv", repeats = 1, number = 3,
summaryFunction = mnLogLoss,
classProbs = TRUE,
allowParallel=T)
tune = train(x=set[,1:10], y=set[,11], method="xgbTree", trControl=control,
tuneGrid = xgb.grid, verbose=TRUE, metric="logLoss", nthread=3)
输出如下所示:
tune
eXtreme Gradient Boosting
10 samples
10 predictors
2 classes: 'N', 'Y'
No pre-processing
Resampling: Cross-Validated (3 fold, repeated 1 times)
Summary of sample sizes: 6, 8, 6
Resampling results across tuning parameters:
eta max_depth logLoss
0.01 2 0.6914816
0.01 4 0.6914816
0.01 6 0.6914816
0.01 8 0.6914816
0.01 10 0.6914816
0.01 14 0.6914816
0.05 2 0.6848399
0.05 4 0.6848399
0.05 6 0.6848399
0.05 8 0.6848399
0.05 10 0.6848399
0.05 14 0.6848399
0.10 2 0.6765847
0.10 4 0.6765847
0.10 6 0.6765847
0.10 8 0.6765847
0.10 10 0.6765847
0.10 14 0.6765847
Tuning parameter 'nrounds' was held constant at a value of 10000
Tuning parameter 'gamma' was held constant at a
value of 0
Tuning parameter 'colsample_bytree' was held constant at a value of 1
Tuning parameter
'min_child_weight' was held constant at a value of 1
Tuning parameter 'subsample' was held constant at a value of 1
logLoss was used to select the optimal model using the smallest value.
The final values used for the model were nrounds = 10000, max_depth = 2, eta
= 0.1, gamma = 0, colsample_bytree =
1, min_child_weight = 1 and subsample = 1.
希望这对您有所帮助,并且正是您所寻求的。我有点怀疑我是否正确执行了日志丢失命令,因为它看起来最大深度实际上对日志丢失没有影响。我使用不同的指标 AUC 重新运行模型,无论更改什么,结果都没有显示任何效果,这与 Cohen 的 Kappa 相同。我猜这是因为只有十个样本,但希望有人能真正解释我所做的,所以这不仅仅是代码转储。