h2o 中的留一法交叉验证
Leave-one-out cross-validation in h2o
我想检查我在 h2o 中的相当小的 df 的留一法交叉验证的结果。这是我的输入 df:https://drive.google.com/file/d/1UiIkxlHCq1tJZNOH6hQD30gEMaPdmhgh/view?usp=sharing
是否可以在h2o中设置nfolds(即nfolds=nrow(df))参数来获得这样的交叉验证?
我无法为 nrow(df)=69 设置 nfolds > 25。
u$dc=as.factor(u$dc)
train <- as.h2o(u)
model <- h2o.gbm(x= colnames(train)[1:15],
y="dc", training_frame=train,
nfolds = 25,
learn_rate = 0.06,
ntrees = 90, max_depth = 3,
min_rows = 2,
distribution = "bernoulli")
我在上面的代码中遇到异常:
Error: water.exceptions.H2OIllegalArgumentException:
Not enough data to create 25 random cross-validation splits. Either reduce nfolds, specify a larger dataset
在ModelBuilder.java中被抛出:
at hex.ModelBuilder.cv_makeWeights(ModelBuilder.java:357)
at hex.ModelBuilder.computeCrossValidation(ModelBuilder.java:276)
at hex.ModelBuilder.compute2(ModelBuilder.java:207)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1263)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
对于包含 69 个示例的提供的数据集,您需要在 h2o.gbm
调用中使用以下参数:
nfolds = 69,
fold_assignment = "Modulo"
例如,这个完整的代码块使用留一法交叉验证运行您的示例,并包含一些额外的行以确认折叠已正确分配:
library(h2o)
h2o.init(strict_version_check = FALSE)
u$dc=as.factor(u$dc)
train <- as.h2o(u)
model <- h2o.gbm(x= colnames(train)[1:15],
y="dc", training_frame=train,
nfolds = 69,
fold_assignment = "Modulo",
keep_cross_validation_fold_assignment = TRUE, # keep track of fold assignment to confirm leave-one-out
learn_rate = 0.06,
ntrees = 90, max_depth = 3,
min_rows = 2,
distribution = "bernoulli")
folds <- h2o.cross_validation_fold_assignment(model) # get fold assignments
print(folds, n = 69) # print all assignment for the 69 folds
print(h2o.dim(h2o.unique(folds))) # count the number of unique values
我想检查我在 h2o 中的相当小的 df 的留一法交叉验证的结果。这是我的输入 df:https://drive.google.com/file/d/1UiIkxlHCq1tJZNOH6hQD30gEMaPdmhgh/view?usp=sharing
是否可以在h2o中设置nfolds(即nfolds=nrow(df))参数来获得这样的交叉验证? 我无法为 nrow(df)=69 设置 nfolds > 25。
u$dc=as.factor(u$dc)
train <- as.h2o(u)
model <- h2o.gbm(x= colnames(train)[1:15],
y="dc", training_frame=train,
nfolds = 25,
learn_rate = 0.06,
ntrees = 90, max_depth = 3,
min_rows = 2,
distribution = "bernoulli")
我在上面的代码中遇到异常:
Error: water.exceptions.H2OIllegalArgumentException:
Not enough data to create 25 random cross-validation splits. Either reduce nfolds, specify a larger dataset
在ModelBuilder.java中被抛出:
at hex.ModelBuilder.cv_makeWeights(ModelBuilder.java:357)
at hex.ModelBuilder.computeCrossValidation(ModelBuilder.java:276)
at hex.ModelBuilder.compute2(ModelBuilder.java:207)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1263)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
对于包含 69 个示例的提供的数据集,您需要在 h2o.gbm
调用中使用以下参数:
nfolds = 69,
fold_assignment = "Modulo"
例如,这个完整的代码块使用留一法交叉验证运行您的示例,并包含一些额外的行以确认折叠已正确分配:
library(h2o)
h2o.init(strict_version_check = FALSE)
u$dc=as.factor(u$dc)
train <- as.h2o(u)
model <- h2o.gbm(x= colnames(train)[1:15],
y="dc", training_frame=train,
nfolds = 69,
fold_assignment = "Modulo",
keep_cross_validation_fold_assignment = TRUE, # keep track of fold assignment to confirm leave-one-out
learn_rate = 0.06,
ntrees = 90, max_depth = 3,
min_rows = 2,
distribution = "bernoulli")
folds <- h2o.cross_validation_fold_assignment(model) # get fold assignments
print(folds, n = 69) # print all assignment for the 69 folds
print(h2o.dim(h2o.unique(folds))) # count the number of unique values