如何解释 h2o.predict() 结果的概率 (p0, p1)
How to interpret the probabilities (p0, p1) of the result of h2o.predict()
我想了解h2o.predict() function from H2o R-package. I realized that in some cases when the predict
column is 1
, the p1
column has a lower value than the column p0
. My interpretation of p0
and p1
columns refer to the probabilities for each event, so I expected when predict=1
the probability of p1
should be higher than the probability of the opposite event (p0
), but it doesn't occur always as I can show in the following example: using prostate dataset的值(结果)的含义。
这里是executable例子:
library(h2o)
h2o.init(max_mem_size = "12g", nthreads = -1)
prostate.hex <- h2o.importFile("https://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv")
prostate.hex$CAPSULE <- as.factor(prostate.hex$CAPSULE)
prostate.hex$RACE <- as.factor(prostate.hex$RACE)
prostate.hex$DCAPS <- as.factor(prostate.hex$DCAPS)
prostate.hex$DPROS <- as.factor(prostate.hex$DPROS)
prostate.hex.split = h2o.splitFrame(data = prostate.hex,
ratios = c(0.70, 0.20, 0.10), seed = 1234)
train.hex <- prostate.hex.split[[1]]
validate.hex <- prostate.hex.split[[2]]
test.hex <- prostate.hex.split[[3]]
fit <- h2o.glm(y = "CAPSULE", x = c("AGE", "RACE", "PSA", "DCAPS"),
training_frame = train.hex,
validation_frame = validate.hex,
family = "binomial", nfolds = 0, alpha = 0.5)
prostate.predict = h2o.predict(object = fit, newdata = test.hex)
result <- as.data.frame(prostate.predict)
subset(result, predict == 1 & p1 < 0.4)
我得到以下 subset
函数结果的输出:
predict p0 p1
11 1 0.6355974 0.3644026
17 1 0.6153021 0.3846979
23 1 0.6289063 0.3710937
25 1 0.6007919 0.3992081
31 1 0.6239587 0.3760413
对于来自 test.hex
数据集的所有上述观察结果,预测为 1
,但 p0 > p1
.
predict=1
但 p1 < p0
的总观测值是:
> nrow(subset(result, predict == 1 & p1 < p0))
[1] 14
相反,没有 predict=0
其中 p0 < p1
> nrow(subset(result, predict == 0 & p0 < p1))
[1] 0
这里是 table for table
的信息 predict
:
> table(result$predict)
0 1
18 23
我们使用具有以下值的 CAPSULE
作为决策变量:
> levels(as.data.frame(prostate.hex)$CAPSULE)
[1] "0" "1"
有什么建议吗?
注意: 类似主题的问题:没有解决这个具体问题。
您所描述的是阈值 0.5。事实上,将使用一个不同的阈值,一个最大化特定指标的阈值。默认指标为 F1 (*);如果您打印模型信息,您可以找到用于每个指标的阈值。
查看问题: 了解更多信息(您的问题不同,这就是我没有将其标记为重复的原因)。
据我所知,您无法将 F1 默认值更改为 h2o.predict()
或 h2o.performance()
。但是您可以使用 h2o.confusionMatrix()
给定您的模型 fit
,并改用最大 F2:
h2o.confusionMatrix(fit, metrics = "f2")
您也可以直接使用 h2o.predict()
"p0" 列,使用您自己的阈值,而不是 "predict" 列。 (这就是我以前所做的。)
*:定义在这里:https://github.com/h2oai/h2o-3/blob/fdde85e41bad5f31b6b841b300ce23cfb2d8c0b0/h2o-core/src/main/java/hex/AUC2.java#L34该文件还显示了每个指标的计算方式。
似乎(另见 here)在 validation
数据集上最大化 F1 score
的阈值被用作 classification 的默认阈值 h2o.glm()
。我们可以观察到以下情况:
- 在验证数据集上最大化
F1 score
的阈值是 0.363477
。
- 预测
p1
概率小于此阈值的所有数据点被 class 化为 0
class(预测为 0
的数据点class 的概率最高 p1
= 0.3602365
< 0.363477
)。
所有预测 p1
概率大于此阈值的数据点被 class 化为 1
class(预测为1
class 的概率最低 p1
= 0.3644026
> 0.363477
)。
min(result[result$predict==1,]$p1)
# [1] 0.3644026
max(result[result$predict==0,]$p1)
# [1] 0.3602365
# Thresholds found by maximizing the metrics on the training dataset
fit@model$training_metrics@metrics$max_criteria_and_metric_scores
#Maximum Metrics: Maximum metrics at their respective thresholds
# metric threshold value idx
#1 max f1 0.314699 0.641975 200
#2 max f2 0.215203 0.795148 262
#3 max f0point5 0.451965 0.669856 74
#4 max accuracy 0.451965 0.707581 74
#5 max precision 0.998285 1.000000 0
#6 max recall 0.215203 1.000000 262
#7 max specificity 0.998285 1.000000 0
#8 max absolute_mcc 0.451965 0.395147 74
#9 max min_per_class_accuracy 0.360174 0.652542 127
#10 max mean_per_class_accuracy 0.391279 0.683269 97
# Thresholds found by maximizing the metrics on the validation dataset
fit@model$validation_metrics@metrics$max_criteria_and_metric_scores
#Maximum Metrics: Maximum metrics at their respective thresholds
# metric threshold value idx
#1 max f1 0.363477 0.607143 33
#2 max f2 0.292342 0.785714 51
#3 max f0point5 0.643382 0.725806 9
#4 max accuracy 0.643382 0.774194 9
#5 max precision 0.985308 1.000000 0
#6 max recall 0.292342 1.000000 51
#7 max specificity 0.985308 1.000000 0
#8 max absolute_mcc 0.643382 0.499659 9
#9 max min_per_class_accuracy 0.379602 0.650000 28
#10 max mean_per_class_accuracy 0.618286 0.702273 11
result[order(result$predict),]
# predict p0 p1
#5 0 0.703274569 0.2967254
#6 0 0.639763460 0.3602365
#13 0 0.689557497 0.3104425
#14 0 0.656764541 0.3432355
#15 0 0.696248328 0.3037517
#16 0 0.707069611 0.2929304
#18 0 0.692137408 0.3078626
#19 0 0.701482762 0.2985172
#20 0 0.705973644 0.2940264
#21 0 0.701156961 0.2988430
#22 0 0.671778898 0.3282211
#24 0 0.646735016 0.3532650
#26 0 0.646582708 0.3534173
#27 0 0.690402957 0.3095970
#32 0 0.649945017 0.3500550
#37 0 0.804937468 0.1950625
#40 0 0.717706731 0.2822933
#41 0 0.642094040 0.3579060
#1 1 0.364577068 0.6354229
#2 1 0.503432724 0.4965673
#3 1 0.406771233 0.5932288
#4 1 0.551801718 0.4481983
#7 1 0.339600779 0.6603992
#8 1 0.002978593 0.9970214
#9 1 0.378034417 0.6219656
#10 1 0.596298925 0.4037011
#11 1 0.635597359 0.3644026
#12 1 0.552662241 0.4473378
#17 1 0.615302107 0.3846979
#23 1 0.628906297 0.3710937
#25 1 0.600791894 0.3992081
#28 1 0.216571552 0.7834284
#29 1 0.559174924 0.4408251
#30 1 0.489514642 0.5104854
#31 1 0.623958696 0.3760413
#33 1 0.504691497 0.4953085
#34 1 0.582509462 0.4174905
#35 1 0.504136056 0.4958639
#36 1 0.463076505 0.5369235
#38 1 0.510908093 0.4890919
#39 1 0.469376828 0.5306232
我想了解h2o.predict() function from H2o R-package. I realized that in some cases when the predict
column is 1
, the p1
column has a lower value than the column p0
. My interpretation of p0
and p1
columns refer to the probabilities for each event, so I expected when predict=1
the probability of p1
should be higher than the probability of the opposite event (p0
), but it doesn't occur always as I can show in the following example: using prostate dataset的值(结果)的含义。
这里是executable例子:
library(h2o)
h2o.init(max_mem_size = "12g", nthreads = -1)
prostate.hex <- h2o.importFile("https://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv")
prostate.hex$CAPSULE <- as.factor(prostate.hex$CAPSULE)
prostate.hex$RACE <- as.factor(prostate.hex$RACE)
prostate.hex$DCAPS <- as.factor(prostate.hex$DCAPS)
prostate.hex$DPROS <- as.factor(prostate.hex$DPROS)
prostate.hex.split = h2o.splitFrame(data = prostate.hex,
ratios = c(0.70, 0.20, 0.10), seed = 1234)
train.hex <- prostate.hex.split[[1]]
validate.hex <- prostate.hex.split[[2]]
test.hex <- prostate.hex.split[[3]]
fit <- h2o.glm(y = "CAPSULE", x = c("AGE", "RACE", "PSA", "DCAPS"),
training_frame = train.hex,
validation_frame = validate.hex,
family = "binomial", nfolds = 0, alpha = 0.5)
prostate.predict = h2o.predict(object = fit, newdata = test.hex)
result <- as.data.frame(prostate.predict)
subset(result, predict == 1 & p1 < 0.4)
我得到以下 subset
函数结果的输出:
predict p0 p1
11 1 0.6355974 0.3644026
17 1 0.6153021 0.3846979
23 1 0.6289063 0.3710937
25 1 0.6007919 0.3992081
31 1 0.6239587 0.3760413
对于来自 test.hex
数据集的所有上述观察结果,预测为 1
,但 p0 > p1
.
predict=1
但 p1 < p0
的总观测值是:
> nrow(subset(result, predict == 1 & p1 < p0))
[1] 14
相反,没有 predict=0
其中 p0 < p1
> nrow(subset(result, predict == 0 & p0 < p1))
[1] 0
这里是 table for table
的信息 predict
:
> table(result$predict)
0 1
18 23
我们使用具有以下值的 CAPSULE
作为决策变量:
> levels(as.data.frame(prostate.hex)$CAPSULE)
[1] "0" "1"
有什么建议吗?
注意: 类似主题的问题:
您所描述的是阈值 0.5。事实上,将使用一个不同的阈值,一个最大化特定指标的阈值。默认指标为 F1 (*);如果您打印模型信息,您可以找到用于每个指标的阈值。
查看问题:
据我所知,您无法将 F1 默认值更改为 h2o.predict()
或 h2o.performance()
。但是您可以使用 h2o.confusionMatrix()
给定您的模型 fit
,并改用最大 F2:
h2o.confusionMatrix(fit, metrics = "f2")
您也可以直接使用 h2o.predict()
"p0" 列,使用您自己的阈值,而不是 "predict" 列。 (这就是我以前所做的。)
*:定义在这里:https://github.com/h2oai/h2o-3/blob/fdde85e41bad5f31b6b841b300ce23cfb2d8c0b0/h2o-core/src/main/java/hex/AUC2.java#L34该文件还显示了每个指标的计算方式。
似乎(另见 here)在 validation
数据集上最大化 F1 score
的阈值被用作 classification 的默认阈值 h2o.glm()
。我们可以观察到以下情况:
- 在验证数据集上最大化
F1 score
的阈值是0.363477
。 - 预测
p1
概率小于此阈值的所有数据点被 class 化为0
class(预测为0
的数据点class 的概率最高p1
=0.3602365
<0.363477
)。 所有预测
p1
概率大于此阈值的数据点被 class 化为1
class(预测为1
class 的概率最低p1
=0.3644026
>0.363477
)。min(result[result$predict==1,]$p1) # [1] 0.3644026 max(result[result$predict==0,]$p1) # [1] 0.3602365 # Thresholds found by maximizing the metrics on the training dataset fit@model$training_metrics@metrics$max_criteria_and_metric_scores #Maximum Metrics: Maximum metrics at their respective thresholds # metric threshold value idx #1 max f1 0.314699 0.641975 200 #2 max f2 0.215203 0.795148 262 #3 max f0point5 0.451965 0.669856 74 #4 max accuracy 0.451965 0.707581 74 #5 max precision 0.998285 1.000000 0 #6 max recall 0.215203 1.000000 262 #7 max specificity 0.998285 1.000000 0 #8 max absolute_mcc 0.451965 0.395147 74 #9 max min_per_class_accuracy 0.360174 0.652542 127 #10 max mean_per_class_accuracy 0.391279 0.683269 97 # Thresholds found by maximizing the metrics on the validation dataset fit@model$validation_metrics@metrics$max_criteria_and_metric_scores #Maximum Metrics: Maximum metrics at their respective thresholds # metric threshold value idx #1 max f1 0.363477 0.607143 33 #2 max f2 0.292342 0.785714 51 #3 max f0point5 0.643382 0.725806 9 #4 max accuracy 0.643382 0.774194 9 #5 max precision 0.985308 1.000000 0 #6 max recall 0.292342 1.000000 51 #7 max specificity 0.985308 1.000000 0 #8 max absolute_mcc 0.643382 0.499659 9 #9 max min_per_class_accuracy 0.379602 0.650000 28 #10 max mean_per_class_accuracy 0.618286 0.702273 11 result[order(result$predict),] # predict p0 p1 #5 0 0.703274569 0.2967254 #6 0 0.639763460 0.3602365 #13 0 0.689557497 0.3104425 #14 0 0.656764541 0.3432355 #15 0 0.696248328 0.3037517 #16 0 0.707069611 0.2929304 #18 0 0.692137408 0.3078626 #19 0 0.701482762 0.2985172 #20 0 0.705973644 0.2940264 #21 0 0.701156961 0.2988430 #22 0 0.671778898 0.3282211 #24 0 0.646735016 0.3532650 #26 0 0.646582708 0.3534173 #27 0 0.690402957 0.3095970 #32 0 0.649945017 0.3500550 #37 0 0.804937468 0.1950625 #40 0 0.717706731 0.2822933 #41 0 0.642094040 0.3579060 #1 1 0.364577068 0.6354229 #2 1 0.503432724 0.4965673 #3 1 0.406771233 0.5932288 #4 1 0.551801718 0.4481983 #7 1 0.339600779 0.6603992 #8 1 0.002978593 0.9970214 #9 1 0.378034417 0.6219656 #10 1 0.596298925 0.4037011 #11 1 0.635597359 0.3644026 #12 1 0.552662241 0.4473378 #17 1 0.615302107 0.3846979 #23 1 0.628906297 0.3710937 #25 1 0.600791894 0.3992081 #28 1 0.216571552 0.7834284 #29 1 0.559174924 0.4408251 #30 1 0.489514642 0.5104854 #31 1 0.623958696 0.3760413 #33 1 0.504691497 0.4953085 #34 1 0.582509462 0.4174905 #35 1 0.504136056 0.4958639 #36 1 0.463076505 0.5369235 #38 1 0.510908093 0.4890919 #39 1 0.469376828 0.5306232