如何计算每个 scikit-learn ML 模型样本的二进制对数损失
how to compute binary log loss per sample of scikit-learn ML model
我正在尝试将二进制对数损失应用于我创建的朴素贝叶斯 ML 模型。我生成了一个分类预测数据集 (yNew) 和一个概率数据集 (probabilityYes),但无法在对数损失函数中成功 运行 它们。
简单的 sklearn.metrics 函数给出一个单一的对数损失结果 - 不确定如何解释这个
from sklearn.metrics import log_loss
ll = log_loss(yNew, probabilityYes, eps=1e-15)
print(ll)
.0819....
更复杂的函数returns每个 NO 的值为 2.55,每个 YES 的值为 2.50(总共 90 列)- 同样,不知道如何解释这个
def logloss(yNew,probabilityYes):
epsilon = 1e-15
probabilityYes = sp.maximum(epsilon, probabilityYes)
probabilityYes = sp.minimum(1-epsilon, probabilityYes)
#compute logloss function (vectorised)
ll = sum(yNew*sp.log(probabilityYes) +
sp.subtract(1,yNew)*sp.log(sp.subtract(1,probabilityYes)))
ll = ll * -1.0/len(yNew)
return ll
print(logloss(yNew,probabilityYes))
2.55352047 2.55352047 2.50358354 2.55352047 2.50358354 2.55352047 .....
以下是计算每个样本损失的方法:
import numpy as np
def logloss(true_label, predicted, eps=1e-15):
p = np.clip(predicted, eps, 1 - eps)
if true_label == 1:
return -np.log(p)
else:
return -np.log(1 - p)
让我们用一些虚拟数据检查一下(我们实际上不需要模型):
predictions = np.array([0.25,0.65,0.2,0.51,
0.01,0.1,0.34,0.97])
targets = np.array([1,0,0,0,
0,0,0,1])
ll = [logloss(x,y) for (x,y) in zip(targets, predictions)]
ll
# result:
[1.3862943611198906,
1.0498221244986778,
0.2231435513142097,
0.7133498878774648,
0.01005033585350145,
0.10536051565782628,
0.41551544396166595,
0.030459207484708574]
从上面的数组中,您应该能够说服自己,预测与相应的真实标签越远,损失就越大,正如我们直观地预期的那样。
让我们确认上面的计算与 scikit-learn 返回的总(平均)损失一致:
from sklearn.metrics import log_loss
ll_sk = log_loss(targets, predictions)
ll_sk
# 0.4917494284709932
np.mean(ll)
# 0.4917494284709932
np.mean(ll) == ll_sk
# True
改编自 here [link 的代码现已失效。
我正在尝试将二进制对数损失应用于我创建的朴素贝叶斯 ML 模型。我生成了一个分类预测数据集 (yNew) 和一个概率数据集 (probabilityYes),但无法在对数损失函数中成功 运行 它们。
简单的 sklearn.metrics 函数给出一个单一的对数损失结果 - 不确定如何解释这个
from sklearn.metrics import log_loss
ll = log_loss(yNew, probabilityYes, eps=1e-15)
print(ll)
.0819....
更复杂的函数returns每个 NO 的值为 2.55,每个 YES 的值为 2.50(总共 90 列)- 同样,不知道如何解释这个
def logloss(yNew,probabilityYes):
epsilon = 1e-15
probabilityYes = sp.maximum(epsilon, probabilityYes)
probabilityYes = sp.minimum(1-epsilon, probabilityYes)
#compute logloss function (vectorised)
ll = sum(yNew*sp.log(probabilityYes) +
sp.subtract(1,yNew)*sp.log(sp.subtract(1,probabilityYes)))
ll = ll * -1.0/len(yNew)
return ll
print(logloss(yNew,probabilityYes))
2.55352047 2.55352047 2.50358354 2.55352047 2.50358354 2.55352047 .....
以下是计算每个样本损失的方法:
import numpy as np
def logloss(true_label, predicted, eps=1e-15):
p = np.clip(predicted, eps, 1 - eps)
if true_label == 1:
return -np.log(p)
else:
return -np.log(1 - p)
让我们用一些虚拟数据检查一下(我们实际上不需要模型):
predictions = np.array([0.25,0.65,0.2,0.51,
0.01,0.1,0.34,0.97])
targets = np.array([1,0,0,0,
0,0,0,1])
ll = [logloss(x,y) for (x,y) in zip(targets, predictions)]
ll
# result:
[1.3862943611198906,
1.0498221244986778,
0.2231435513142097,
0.7133498878774648,
0.01005033585350145,
0.10536051565782628,
0.41551544396166595,
0.030459207484708574]
从上面的数组中,您应该能够说服自己,预测与相应的真实标签越远,损失就越大,正如我们直观地预期的那样。
让我们确认上面的计算与 scikit-learn 返回的总(平均)损失一致:
from sklearn.metrics import log_loss
ll_sk = log_loss(targets, predictions)
ll_sk
# 0.4917494284709932
np.mean(ll)
# 0.4917494284709932
np.mean(ll) == ll_sk
# True
改编自 here [link 的代码现已失效。