h2o.performance 预测与 h2o.predict 不同？

Question

抱歉，如果此问题已在其他地方得到解答，但我找不到任何内容。

我在 R 中使用 h2o（最新版本）。我使用 h2o.grid（用于参数调整）创建了一个随机森林模型，并将其命名为 'my_rf'

我的步骤如下：

使用参数调整和交叉验证训练 `randomForests 网格 (nfolds = 5)
获取模型的排序网格（按 AUC）并设置 my_rf = 最佳模型
使用 h2o 性能（my_rf，测试）评估测试集上的 auc、准确性等
使用 h2o.predict 预测测试集并导出结果

我用于 h2o.performance 的确切行是：

h2o.performance(my_rf, newdata = as.h2o(test))

...这给了我一个混淆矩阵，我可以从中计算准确度（以及给我 AUC、最大 F1 分数等）

我本以为使用

h2o.predict(my_rf, newdata = as.h2o(test))

我可以从 h2o.performance 复制混淆矩阵。但准确度不同——实际上要差 3%。

谁能解释为什么会这样？

此外，有什么方法可以 return 构成 h2o.performance 中的混淆矩阵的预测吗？

编辑：这里是相关代码：

library(mlbench)
data(Sonar)
head(Sonar)

mainset <- Sonar
mainset$Class <- ifelse(mainset$Class == "M", 0,1)          #binarize
mainset$Class <- as.factor(mainset$Class)

response <- "Class"
predictors <- setdiff(names(mainset), c(response, "name"))

# split into training and test set

library(caTools)
set.seed(123)
split = sample.split(mainset[,61], SplitRatio = 0.75)
train = subset(mainset, split == TRUE)
test =  subset(mainset, split == FALSE)

# connect to h2o

Sys.unsetenv("http_proxy")
Sys.setenv(JAVA_HOME='C:\Program Files (x86)\Java\jre7')                #set JAVA home for 32 bit
library(h2o)
h2o.init(nthread = -1)

# stacked ensembles

nfolds <- 5
ntrees_opts <- c(20:500)             
max_depth_opts <- c(4,8,12,16,20)
sample_rate_opts <- seq(0.3,1,0.05)
col_sample_rate_opts <- seq(0.3,1,0.05)

rf_hypers <- list(ntrees = ntrees_opts, max_depth = max_depth_opts,
                  sample_rate = sample_rate_opts,
                  col_sample_rate_per_tree = col_sample_rate_opts)

search_criteria <- list(strategy = 'RandomDiscrete', max_runtime_secs = 240, max_models = 15,
stopping_metric = "AUTO", stopping_tolerance = 0.00001, stopping_rounds = 5,seed = 1)

my_rf <- h2o.grid("randomForest", grid_id = "rf_grid", x = predictors, y = response,
                                                                training_frame = as.h2o(train),
                                                                nfolds = 5,
                                                                fold_assignment = "Modulo",
                                                                keep_cross_validation_predictions = TRUE,
                                                                hyper_params = rf_hypers,
                                                                search_criteria = search_criteria)

get_grid_rf <- h2o.getGrid(grid_id = "rf_grid", sort_by = "auc", decreasing = TRUE)                         # get grid of models built
my_rf <- h2o.getModel(get_grid_rf@model_ids[[1]])
perf_rf <- h2o.performance(my_rf, newdata = as.h2o(test))

pred <- h2o.predict(my_rf, newdata = as.h2o(test))
pred <- as.vectpr(pred$predict)

cm <- table(test[,61], pred)
print(cm)

Answer 1

很可能，函数 h2o.performance 正在使用 F1 阈值来设置是和否。如果您采用预测结果并根据模型 "F1 threshold" 值检测 table 以分离 yes/no，您将看到该数字几乎匹配。我相信这是您看到 h2o.performance 和 h2o.predict 之间结果存在差异的主要原因。

Answer 2

当对没有实际比较结果的新数据进行预测时（'y' 参数以 h2o 术语表示），没有 F1 Max 分数或其他指标，您必须依赖所做的预测来自 h2o.predict().

Answer 3

performance() 和 predict() 的区别如下所述。它直接来自 H2O 的帮助页面 - http://docs.h2o.ai/h2o/latest-stable/h2o-docs/performance-and-prediction.html#prediction

预测阈值

对于分类问题，当运行 h2o.predict()或.predict()时，预测阈值选择如下：

如果您仅使用训练数据训练模型，则使用训练数据模型指标中的最大 F1 阈值。
如果您使用训练数据和验证数据训练模型，则会使用验证数据模型指标中的最大 F1 阈值。
如果您使用训练数据训练模型并设置 nfold 参数，则使用训练数据模型指标中的最大 F1 阈值。
如果您使用训练数据和验证数据训练模型并设置 nfold 参数，则使用验证数据模型指标中的 Max F1 阈值。

h2o.performance 预测与 h2o.predict 不同？

h2o.performance predictions differ from h2o.predict?

performance

r

predict

h2o