为什么H2OGains/Lifttable只有16行?它们应该如何解释?

Why are there only 16 rows in the H2O Gains/Lift table? How should they be interpreted?

最新的 H2O 文档指出 "The data is divided into groups by quantile thresholds of the response probability. Note that the default number of groups is 20; if there are fewer than 20 unique probability values, then the number of groups is reduced to the number of unique quantile thresholds." http://docs.h2o.ai/h2o/latest-stable/h2o-docs/flow.html#interpreting-the-gains-lift-chart

然而,实际上,即使输入数据中有超过 20 个唯一概率值,也只会生成 16 行,并且不清楚应如何解释它们。

您甚至可以在直接取自 h2o.gainsLift() 帮助页面的示例代码中看到这一点:

library(h2o)
h2o.init()
prosPath <- system.file("extdata", "prostate.csv", package="h2o")
hex <- h2o.uploadFile(prosPath)
hex[,2] <- as.factor(hex[,2])
model <- h2o.gbm(x = 3:9, y = 2, distribution = "bernoulli",
                                 training_frame = hex, validation_frame = hex, nfolds=3)
h2o.gainsLift(model)              ## extract training metrics. Note that there are only 16 rows in the Gains/Lift Table.
h2o.gainsLift(model, valid=TRUE)  ## extract validation metrics (here: the same)
h2o.gainsLift(model, xval =TRUE)  ## extract cross-validation metrics
h2o.gainsLift(model, newdata=hex) ## score on new data (here: the same)
# Generating a ModelMetrics object
perf <- h2o.performance(model, hex)
h2o.gainsLift(perf)               ## extract from existing metrics object. Note that there are still only 16 rows in the Gains/Lift Table.

# There are 380 unique predicted probability values, which is greater than 20. 
length(unique(as.data.frame(h2o.predict(model, hex))$p1))

此外,鉴于此页面上显示的 gains/lift 的 "sanity checks" 包含不均匀的分位数,我倾向于认为这些行不代表 16 个均匀分档的分位数:https://github.com/h2oai/h2o-3/blob/master/h2o-r/tests/testdir_jira/runit_pubdev_2372_gainLift.R

请参阅该页面上的第 36 行,我相信其中定义了 bin。它们显示为: 概率 = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.85,0.9,0.95,0.96,0.97,0.98,0.99)

我如何理解 Gains/Lift table 上显示的内容?我可以自定义显示的 n-tile bins 吗?理想情况下,我更愿意看到 10 个箱子。

谢谢。

文档应该说 16 个组而不是 20 个(最初默认是 20 个组,但后来更新了),我已经为您可以关注的问题制作了一个 jira 票证:https://0xdata.atlassian.net/browse/PUBDEV-5709?filter=-2.

您不能在不触及 Java 代码的情况下更改分位数,但您可以对您感兴趣的累积数据分数(查看 cumuluative_data_fraction 列)进行子集化(收益lift table 为您提供了比您可能需要的更多信息。