Scikit-learn:如何获得真阳性、真阴性、假阳性和假阴性
Scikit-learn: How to obtain True Positive, True Negative, False Positive and False Negative
我的问题:
我有一个数据集,它是一个很大的 JSON 文件。我读取它并将其存储在 trainList
变量中。
接下来,我对其进行预处理 - 为了能够使用它。
完成后我开始分类:
- 我使用
kfold
交叉验证方法来获得平均值
准确性并训练分类器。
- 我做出预测并获得该折叠的准确度和混淆矩阵。
- 在此之后,我想获得
True Positive(TP)
、True Negative(TN)
、False Positive(FP)
和 False Negative(FN)
值。我将使用这些参数来获得 Sensitivity 和 Specificity。
最后,我会用它来输入 HTML 以显示包含每个标签的 TP 的图表。
代码:
我暂时掌握的变量:
trainList #It is a list with all the data of my dataset in JSON form
labelList #It is a list with all the labels of my data
大部分方法:
#I transform the data from JSON form to a numerical one
X=vec.fit_transform(trainList)
#I scale the matrix (don't know why but without it, it makes an error)
X=preprocessing.scale(X.toarray())
#I generate a KFold in order to make cross validation
kf = KFold(len(X), n_folds=10, indices=True, shuffle=True, random_state=1)
#I start the cross validation
for train_indices, test_indices in kf:
X_train=[X[ii] for ii in train_indices]
X_test=[X[ii] for ii in test_indices]
y_train=[listaLabels[ii] for ii in train_indices]
y_test=[listaLabels[ii] for ii in test_indices]
#I train the classifier
trained=qda.fit(X_train,y_train)
#I make the predictions
predicted=qda.predict(X_test)
#I obtain the accuracy of this fold
ac=accuracy_score(predicted,y_test)
#I obtain the confusion matrix
cm=confusion_matrix(y_test, predicted)
#I should calculate the TP,TN, FP and FN
#I don't know how to continue
您可以从混淆矩阵中获取所有参数。
混淆矩阵(2X2矩阵)的结构如下(假设第一个索引与正标签相关,行与真实标签相关):
TP|FN
FP|TN
所以
TP = cm[0][0]
FN = cm[0][1]
FP = cm[1][0]
TN = cm[1][1]
如果您有两个列表,其中包含预测值和实际值;正如您所做的那样,您可以将它们传递给一个函数,该函数将使用如下内容计算 TP、FP、TN、FN:
def perf_measure(y_actual, y_hat):
TP = 0
FP = 0
TN = 0
FN = 0
for i in range(len(y_hat)):
if y_actual[i]==y_hat[i]==1:
TP += 1
if y_hat[i]==1 and y_actual[i]!=y_hat[i]:
FP += 1
if y_actual[i]==y_hat[i]==0:
TN += 1
if y_hat[i]==0 and y_actual[i]!=y_hat[i]:
FN += 1
return(TP, FP, TN, FN)
从这里我认为你将能够计算出你的利率,以及其他性能指标,如特异性和敏感性。
我认为这两个答案都不完全正确。例如,假设我们有以下数组;
y_actual = [1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0]
y_predic = [1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0]
如果我们手动计算 FP、FN、TP 和 TN 值,它们应该如下所示:
FP:3
前锋:1
TP:3
总和:4
但是,如果我们使用第一个答案,结果如下:
FP:1
前锋:3
TP:3
总和:4
他们不正确,因为在第一个答案中,False Positive 应该是实际为 0,但预测为 1,而不是相反。假阴性也一样。
并且,如果我们使用第二个答案,结果计算如下:
FP:3
前锋:1
TP:4
总和:3
真正数和真负数不正确,应该相反。
我的计算正确吗?如果我遗漏了什么,请告诉我。
根据 scikit-learn 文档,
根据定义,混淆矩阵 C 使得 C[i, j]
等于已知属于 i
组但预测属于 j
组的观测值的数量。
因此在二元分类中,真负数为C[0,0]
,假负数为C[1,0]
,真正数为C[1,1]
,假正数为C[0,1]
。
CM = confusion_matrix(y_true, y_pred)
TN = CM[0][0]
FN = CM[1][0]
TP = CM[1][1]
FP = CM[0][1]
如果您的分类器中有多个 类,您可能希望在该部分使用 pandas-ml。 pandas-ml 的混淆矩阵提供了更详细的信息。 check that
对于多class的情况,你需要的一切都可以从混淆矩阵中找到。例如,如果您的混淆矩阵如下所示:
那么根据 class,您要查找的内容可以这样找到:
使用 pandas/numpy,您可以像这样一次对所有 class 执行此操作:
FP = confusion_matrix.sum(axis=0) - np.diag(confusion_matrix)
FN = confusion_matrix.sum(axis=1) - np.diag(confusion_matrix)
TP = np.diag(confusion_matrix)
TN = confusion_matrix.values.sum() - (FP + FN + TP)
# Sensitivity, hit rate, recall, or true positive rate
TPR = TP/(TP+FN)
# Specificity or true negative rate
TNR = TN/(TN+FP)
# Precision or positive predictive value
PPV = TP/(TP+FP)
# Negative predictive value
NPV = TN/(TN+FN)
# Fall out or false positive rate
FPR = FP/(FP+TN)
# False negative rate
FNR = FN/(TP+FN)
# False discovery rate
FDR = FP/(TP+FP)
# Overall accuracy
ACC = (TP+TN)/(TP+FP+FN+TN)
在 scikit-learn 'metrics' 库中有一个 confusion_matrix 方法可以为您提供所需的输出。
您可以使用任何您想要的分类器。这里我以KNeighbors为例。
from sklearn import metrics, neighbors
clf = neighbors.KNeighborsClassifier()
X_test = ...
y_test = ...
expected = y_test
predicted = clf.predict(X_test)
conf_matrix = metrics.confusion_matrix(expected, predicted)
>>> print conf_matrix
>>> [[1403 87]
[ 56 3159]]
这是调用 shell 的错误代码(目前显示为已接受的答案)的修复程序:
def performance_measure(y_actual, y_hat):
TP = 0
FP = 0
TN = 0
FN = 0
for i in range(len(y_hat)):
if y_actual[i] == y_hat[i]==1:
TP += 1
if y_hat[i] == 1 and y_actual[i] == 0:
FP += 1
if y_hat[i] == y_actual[i] == 0:
TN +=1
if y_hat[i] == 0 and y_actual[i] == 1:
FN +=1
return(TP, FP, TN, FN)
我写了一个只使用 numpy 的版本。
希望对你有帮助。
import numpy as np
def perf_metrics_2X2(yobs, yhat):
"""
Returns the specificity, sensitivity, positive predictive value, and
negative predictive value
of a 2X2 table.
where:
0 = negative case
1 = positive case
Parameters
----------
yobs : array of positive and negative ``observed`` cases
yhat : array of positive and negative ``predicted`` cases
Returns
-------
sensitivity = TP / (TP+FN)
specificity = TN / (TN+FP)
pos_pred_val = TP/ (TP+FP)
neg_pred_val = TN/ (TN+FN)
Author: Julio Cardenas-Rodriguez
"""
TP = np.sum( yobs[yobs==1] == yhat[yobs==1] )
TN = np.sum( yobs[yobs==0] == yhat[yobs==0] )
FP = np.sum( yobs[yobs==1] == yhat[yobs==0] )
FN = np.sum( yobs[yobs==0] == yhat[yobs==1] )
sensitivity = TP / (TP+FN)
specificity = TN / (TN+FP)
pos_pred_val = TP/ (TP+FP)
neg_pred_val = TN/ (TN+FN)
return sensitivity, specificity, pos_pred_val, neg_pred_val
您可以尝试 sklearn.metrics.classification_report
如下:
import sklearn
y_true = [1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0]
y_pred = [1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0]
print sklearn.metrics.classification_report(y_true, y_pred)
输出:
precision recall f1-score support
0 0.80 0.57 0.67 7
1 0.50 0.75 0.60 4
avg / total 0.69 0.64 0.64 11
从混淆矩阵中获得真正的积极因素等的一种方式是 ravel 它:
from sklearn.metrics import confusion_matrix
y_true = [1, 1, 0, 0]
y_pred = [1, 0, 1, 0]
tn, fp, fn, tp = confusion_matrix(y_true, y_pred, labels=[0, 1]).ravel()
print(tn, fp, fn, tp) # 1 1 1 1
应该设置 labels
参数以防数据只包含一个案例,例如只有真正的积极因素。正确设置 labels
可确保混淆矩阵具有 2x2 形状。
我已经尝试了一些答案,但发现它们不起作用。
这对我有用:
from sklearn.metrics import classification_report
print(classification_report(y_test, predicted))
以防万一有人在 MULTI-CLASS 示例
中寻找相同的内容
def perf_measure(y_actual, y_pred):
class_id = set(y_actual).union(set(y_pred))
TP = []
FP = []
TN = []
FN = []
for index ,_id in enumerate(class_id):
TP.append(0)
FP.append(0)
TN.append(0)
FN.append(0)
for i in range(len(y_pred)):
if y_actual[i] == y_pred[i] == _id:
TP[index] += 1
if y_pred[i] == _id and y_actual[i] != y_pred[i]:
FP[index] += 1
if y_actual[i] == y_pred[i] != _id:
TN[index] += 1
if y_pred[i] != _id and y_actual[i] != y_pred[i]:
FN[index] += 1
return class_id,TP, FP, TN, FN
在scikit 0.22版本中,你可以这样做
from sklearn.metrics import multilabel_confusion_matrix
y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]
mcm = multilabel_confusion_matrix(y_true, y_pred,labels=["ant", "bird", "cat"])
tn = mcm[:, 0, 0]
tp = mcm[:, 1, 1]
fn = mcm[:, 1, 0]
fp = mcm[:, 0, 1]
#False positive cases
train = pd.merge(X_train, y_train,left_index=True, right_index=True)
y_train_pred = pd.DataFrame(y_train_pred)
y_train_pred.rename(columns={0 :'Predicted'}, inplace=True )
train = train.reset_index(drop=True).merge(y_train_pred.reset_index(drop=True),
left_index=True,right_index=True)
train['FP'] = np.where((train['Banknote']=="Forged") & (train['Predicted']=="Genuine"),1,0)
train[train.FP != 0]
#FalseNegatives
test = pd.merge(Variables_test, Banknote_test,left_index=True, right_index=True)
Banknote_test_pred = pd.DataFrame(banknote_test_pred)
Banknote_test_pred.rename(columns={0 :'Predicted'}, inplace=True )
test = test.reset_index(drop=True).merge(Banknote_test_pred.reset_index(drop=True), left_index=True, right_index=True)
test['FN'] = np.where((test['Banknote']=="Genuine") & (test['Predicted']=="Forged"),1,0)
test[test.FN != 0]
def getTPFPTNFN(y_true, y_pred):
TP, FP, TN, FN = 0, 0, 0, 0
for s_true, s_pred in zip (y_true, y_pred):
if s_true == 1:
if s_pred == 1:
TP += 1
else:
FN += 1
else:
if s_pred == 0:
TN += 1
else:
FP += 1
return TP, FP, TN, FN
None 到目前为止给出的答案对我有用,因为我有时最终会得到一个只有一个条目的混淆矩阵。以下代码能够缓解此问题:
from sklearn.metrics import confusion_matrix
CM = confusion_matrix(y, y_hat)
try:
TN = CM[0][0]
except IndexError:
TN = 0
try:
FN = CM[1][0]
except IndexError:
FN = 0
try:
TP = CM[1][1]
except IndexError:
TP = 0
try:
FP = CM[0][1]
except IndexError:
FP = 0
请注意,“y”是基本事实,“y_hat”是预测。
这很好用
来源 - https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
tn, fp, fn, tp = confusion_matrix(y_test, predicted).ravel()
虽然与scikit-learn无关,但你也可以做的是
tp = sum(y_test & pred)
fp = sum(1-y_test & pred )
tn = sum(1-y_test & 1-pred)
fn = sum(y_test & 1-pred)
我的问题:
我有一个数据集,它是一个很大的 JSON 文件。我读取它并将其存储在 trainList
变量中。
接下来,我对其进行预处理 - 为了能够使用它。
完成后我开始分类:
- 我使用
kfold
交叉验证方法来获得平均值 准确性并训练分类器。 - 我做出预测并获得该折叠的准确度和混淆矩阵。
- 在此之后,我想获得
True Positive(TP)
、True Negative(TN)
、False Positive(FP)
和False Negative(FN)
值。我将使用这些参数来获得 Sensitivity 和 Specificity。
最后,我会用它来输入 HTML 以显示包含每个标签的 TP 的图表。
代码:
我暂时掌握的变量:
trainList #It is a list with all the data of my dataset in JSON form
labelList #It is a list with all the labels of my data
大部分方法:
#I transform the data from JSON form to a numerical one
X=vec.fit_transform(trainList)
#I scale the matrix (don't know why but without it, it makes an error)
X=preprocessing.scale(X.toarray())
#I generate a KFold in order to make cross validation
kf = KFold(len(X), n_folds=10, indices=True, shuffle=True, random_state=1)
#I start the cross validation
for train_indices, test_indices in kf:
X_train=[X[ii] for ii in train_indices]
X_test=[X[ii] for ii in test_indices]
y_train=[listaLabels[ii] for ii in train_indices]
y_test=[listaLabels[ii] for ii in test_indices]
#I train the classifier
trained=qda.fit(X_train,y_train)
#I make the predictions
predicted=qda.predict(X_test)
#I obtain the accuracy of this fold
ac=accuracy_score(predicted,y_test)
#I obtain the confusion matrix
cm=confusion_matrix(y_test, predicted)
#I should calculate the TP,TN, FP and FN
#I don't know how to continue
您可以从混淆矩阵中获取所有参数。 混淆矩阵(2X2矩阵)的结构如下(假设第一个索引与正标签相关,行与真实标签相关):
TP|FN
FP|TN
所以
TP = cm[0][0]
FN = cm[0][1]
FP = cm[1][0]
TN = cm[1][1]
如果您有两个列表,其中包含预测值和实际值;正如您所做的那样,您可以将它们传递给一个函数,该函数将使用如下内容计算 TP、FP、TN、FN:
def perf_measure(y_actual, y_hat):
TP = 0
FP = 0
TN = 0
FN = 0
for i in range(len(y_hat)):
if y_actual[i]==y_hat[i]==1:
TP += 1
if y_hat[i]==1 and y_actual[i]!=y_hat[i]:
FP += 1
if y_actual[i]==y_hat[i]==0:
TN += 1
if y_hat[i]==0 and y_actual[i]!=y_hat[i]:
FN += 1
return(TP, FP, TN, FN)
从这里我认为你将能够计算出你的利率,以及其他性能指标,如特异性和敏感性。
我认为这两个答案都不完全正确。例如,假设我们有以下数组;
y_actual = [1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0]
y_predic = [1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0]
如果我们手动计算 FP、FN、TP 和 TN 值,它们应该如下所示:
FP:3 前锋:1 TP:3 总和:4
但是,如果我们使用第一个答案,结果如下:
FP:1 前锋:3 TP:3 总和:4
他们不正确,因为在第一个答案中,False Positive 应该是实际为 0,但预测为 1,而不是相反。假阴性也一样。
并且,如果我们使用第二个答案,结果计算如下:
FP:3 前锋:1 TP:4 总和:3
真正数和真负数不正确,应该相反。
我的计算正确吗?如果我遗漏了什么,请告诉我。
根据 scikit-learn 文档,
根据定义,混淆矩阵 C 使得 C[i, j]
等于已知属于 i
组但预测属于 j
组的观测值的数量。
因此在二元分类中,真负数为C[0,0]
,假负数为C[1,0]
,真正数为C[1,1]
,假正数为C[0,1]
。
CM = confusion_matrix(y_true, y_pred)
TN = CM[0][0]
FN = CM[1][0]
TP = CM[1][1]
FP = CM[0][1]
如果您的分类器中有多个 类,您可能希望在该部分使用 pandas-ml。 pandas-ml 的混淆矩阵提供了更详细的信息。 check that
对于多class的情况,你需要的一切都可以从混淆矩阵中找到。例如,如果您的混淆矩阵如下所示:
那么根据 class,您要查找的内容可以这样找到:
使用 pandas/numpy,您可以像这样一次对所有 class 执行此操作:
FP = confusion_matrix.sum(axis=0) - np.diag(confusion_matrix)
FN = confusion_matrix.sum(axis=1) - np.diag(confusion_matrix)
TP = np.diag(confusion_matrix)
TN = confusion_matrix.values.sum() - (FP + FN + TP)
# Sensitivity, hit rate, recall, or true positive rate
TPR = TP/(TP+FN)
# Specificity or true negative rate
TNR = TN/(TN+FP)
# Precision or positive predictive value
PPV = TP/(TP+FP)
# Negative predictive value
NPV = TN/(TN+FN)
# Fall out or false positive rate
FPR = FP/(FP+TN)
# False negative rate
FNR = FN/(TP+FN)
# False discovery rate
FDR = FP/(TP+FP)
# Overall accuracy
ACC = (TP+TN)/(TP+FP+FN+TN)
在 scikit-learn 'metrics' 库中有一个 confusion_matrix 方法可以为您提供所需的输出。
您可以使用任何您想要的分类器。这里我以KNeighbors为例。
from sklearn import metrics, neighbors
clf = neighbors.KNeighborsClassifier()
X_test = ...
y_test = ...
expected = y_test
predicted = clf.predict(X_test)
conf_matrix = metrics.confusion_matrix(expected, predicted)
>>> print conf_matrix
>>> [[1403 87]
[ 56 3159]]
这是调用 shell 的错误代码(目前显示为已接受的答案)的修复程序:
def performance_measure(y_actual, y_hat):
TP = 0
FP = 0
TN = 0
FN = 0
for i in range(len(y_hat)):
if y_actual[i] == y_hat[i]==1:
TP += 1
if y_hat[i] == 1 and y_actual[i] == 0:
FP += 1
if y_hat[i] == y_actual[i] == 0:
TN +=1
if y_hat[i] == 0 and y_actual[i] == 1:
FN +=1
return(TP, FP, TN, FN)
我写了一个只使用 numpy 的版本。 希望对你有帮助。
import numpy as np
def perf_metrics_2X2(yobs, yhat):
"""
Returns the specificity, sensitivity, positive predictive value, and
negative predictive value
of a 2X2 table.
where:
0 = negative case
1 = positive case
Parameters
----------
yobs : array of positive and negative ``observed`` cases
yhat : array of positive and negative ``predicted`` cases
Returns
-------
sensitivity = TP / (TP+FN)
specificity = TN / (TN+FP)
pos_pred_val = TP/ (TP+FP)
neg_pred_val = TN/ (TN+FN)
Author: Julio Cardenas-Rodriguez
"""
TP = np.sum( yobs[yobs==1] == yhat[yobs==1] )
TN = np.sum( yobs[yobs==0] == yhat[yobs==0] )
FP = np.sum( yobs[yobs==1] == yhat[yobs==0] )
FN = np.sum( yobs[yobs==0] == yhat[yobs==1] )
sensitivity = TP / (TP+FN)
specificity = TN / (TN+FP)
pos_pred_val = TP/ (TP+FP)
neg_pred_val = TN/ (TN+FN)
return sensitivity, specificity, pos_pred_val, neg_pred_val
您可以尝试 sklearn.metrics.classification_report
如下:
import sklearn
y_true = [1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0]
y_pred = [1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0]
print sklearn.metrics.classification_report(y_true, y_pred)
输出:
precision recall f1-score support
0 0.80 0.57 0.67 7
1 0.50 0.75 0.60 4
avg / total 0.69 0.64 0.64 11
从混淆矩阵中获得真正的积极因素等的一种方式是 ravel 它:
from sklearn.metrics import confusion_matrix
y_true = [1, 1, 0, 0]
y_pred = [1, 0, 1, 0]
tn, fp, fn, tp = confusion_matrix(y_true, y_pred, labels=[0, 1]).ravel()
print(tn, fp, fn, tp) # 1 1 1 1
应该设置 labels
参数以防数据只包含一个案例,例如只有真正的积极因素。正确设置 labels
可确保混淆矩阵具有 2x2 形状。
我已经尝试了一些答案,但发现它们不起作用。
这对我有用:
from sklearn.metrics import classification_report
print(classification_report(y_test, predicted))
以防万一有人在 MULTI-CLASS 示例
中寻找相同的内容def perf_measure(y_actual, y_pred):
class_id = set(y_actual).union(set(y_pred))
TP = []
FP = []
TN = []
FN = []
for index ,_id in enumerate(class_id):
TP.append(0)
FP.append(0)
TN.append(0)
FN.append(0)
for i in range(len(y_pred)):
if y_actual[i] == y_pred[i] == _id:
TP[index] += 1
if y_pred[i] == _id and y_actual[i] != y_pred[i]:
FP[index] += 1
if y_actual[i] == y_pred[i] != _id:
TN[index] += 1
if y_pred[i] != _id and y_actual[i] != y_pred[i]:
FN[index] += 1
return class_id,TP, FP, TN, FN
在scikit 0.22版本中,你可以这样做
from sklearn.metrics import multilabel_confusion_matrix
y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]
mcm = multilabel_confusion_matrix(y_true, y_pred,labels=["ant", "bird", "cat"])
tn = mcm[:, 0, 0]
tp = mcm[:, 1, 1]
fn = mcm[:, 1, 0]
fp = mcm[:, 0, 1]
#False positive cases
train = pd.merge(X_train, y_train,left_index=True, right_index=True)
y_train_pred = pd.DataFrame(y_train_pred)
y_train_pred.rename(columns={0 :'Predicted'}, inplace=True )
train = train.reset_index(drop=True).merge(y_train_pred.reset_index(drop=True),
left_index=True,right_index=True)
train['FP'] = np.where((train['Banknote']=="Forged") & (train['Predicted']=="Genuine"),1,0)
train[train.FP != 0]
#FalseNegatives
test = pd.merge(Variables_test, Banknote_test,left_index=True, right_index=True)
Banknote_test_pred = pd.DataFrame(banknote_test_pred)
Banknote_test_pred.rename(columns={0 :'Predicted'}, inplace=True )
test = test.reset_index(drop=True).merge(Banknote_test_pred.reset_index(drop=True), left_index=True, right_index=True)
test['FN'] = np.where((test['Banknote']=="Genuine") & (test['Predicted']=="Forged"),1,0)
test[test.FN != 0]
def getTPFPTNFN(y_true, y_pred):
TP, FP, TN, FN = 0, 0, 0, 0
for s_true, s_pred in zip (y_true, y_pred):
if s_true == 1:
if s_pred == 1:
TP += 1
else:
FN += 1
else:
if s_pred == 0:
TN += 1
else:
FP += 1
return TP, FP, TN, FN
None 到目前为止给出的答案对我有用,因为我有时最终会得到一个只有一个条目的混淆矩阵。以下代码能够缓解此问题:
from sklearn.metrics import confusion_matrix
CM = confusion_matrix(y, y_hat)
try:
TN = CM[0][0]
except IndexError:
TN = 0
try:
FN = CM[1][0]
except IndexError:
FN = 0
try:
TP = CM[1][1]
except IndexError:
TP = 0
try:
FP = CM[0][1]
except IndexError:
FP = 0
请注意,“y”是基本事实,“y_hat”是预测。
这很好用
来源 - https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
tn, fp, fn, tp = confusion_matrix(y_test, predicted).ravel()
虽然与scikit-learn无关,但你也可以做的是
tp = sum(y_test & pred)
fp = sum(1-y_test & pred )
tn = sum(1-y_test & 1-pred)
fn = sum(y_test & 1-pred)