随机森林 - "perfect" 混淆矩阵
random forest - "perfect" confusion matrix
我有一个分类问题,我想确定哪些潜在借款人不应被邀请参加银行会议。
在数据中,ca。不应邀请 25% 的借款人。
我有大约 4500 个观察值和 86 个特征(很多假人)。
清理数据后,我做:
# Separate X_train and Y_train
X = ratings_prepared[:, :-1]
y= ratings_prepared[:,-1]
##################################################################################
# Separate test and train (stratified, 20% test)
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone
skfolds = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_index, test_index in skfolds.split(X,y):
X_train = X[train_index]
y_train = y[train_index]
X_test = X[test_index]
y_test = y[test_index]
然后,我继续训练模型。 SGD 分类器效果不佳:
def plot_roc_curve(fpr, tpr, label=None):
plt.plot(fpr, tpr, linewidth=2, label =label)
plt.plot([0,1], [0,1],'k--')
plt.axis([0,1,0,1])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1],"b--", label="Precision")
plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
plt.xlabel("Threashold")
plt.legend(loc="center left")
plt.ylim([0,1])
############################# Train Models #############################
from sklearn.linear_model import SGDClassifier
sgd_clf =SGDClassifier(random_state=42)
sgd_clf.fit(X_train,y_train)
y_pred = sgd_clf.predict(X_train)
# f1 score
f1_score(y_train, y_pred)
# confusion matrix
tn, fp, fn, tp = confusion_matrix(y_train, y_pred).ravel()
(tn, fp, fn, tp)
from sklearn.metrics import plot_confusion_matrix
import matplotlib.pyplot as plt
disp = plot_confusion_matrix(sgd_clf, X_train, y_train,
cmap=plt.cm.Blues,
normalize='true')
# recall and precision
from sklearn.metrics import precision_score, recall_score
precision_score(y_train, y_pred)
# Precision Recall
from sklearn.metrics import precision_recall_curve
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.show()
# Plot ROC curve
y_scores = cross_val_predict(sgd_clf, X_train, y_train, cv=3, method="decision_function")
fpr, tpr, thresholds = roc_curve(y_train, y_scores)
plot_roc_curve(fpr, tpr)
plt.show()
# recall and precision
from sklearn.metrics import precision_score, recall_score
precision_score(y_train, y_pred)
### Precision score: 0.5084427767354597
然后我转向随机森林分类器,它应该改进 SGD
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train, cv=3, method='predict_proba')
y_scores_forest = y_probas_forest[:,1]
fpr_forest, tpr_forest, threshold_forest = roc_curve(y_train,y_scores_forest)
plt.plot(fpr, tpr, "b:", label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="lower right")
plt.show()
的确,ROC 曲线看起来更好:
但是混淆矩阵和精度分数非常奇怪:
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train, cv=3, method='predict_proba')
y_scores_forest = y_probas_forest[:,1]
fpr_forest, tpr_forest, threshold_forest = roc_curve(y_train,y_scores_forest)
forest_clf.fit(X_train,y_train)
y_pred = forest_clf.predict(X_train)
# f1 score
f1_score(y_train, y_pred)
# confusion matrix
from sklearn.metrics import plot_confusion_matrix
import matplotlib.pyplot as plt
disp = plot_confusion_matrix(forest_clf, X_train, y_train,
cmap=plt.cm.Blues,
normalize='true')
F分数也是1,我不明白这是怎么回事。我怀疑我犯了一个错误,但 SGD 分类器似乎工作正常的事实让我认为这与数据清理无关。
知道可能出了什么问题吗?
#
更新:
1) 混淆矩阵的绝对值:
2) 降低阈值:
你得到满分的原因是你没有在测试数据上做指标。
在第一段中,您对训练数据和测试数据进行了 80/20 分割,但是随后所有指标 ROC、混淆矩阵等都是在原始训练数据上而不是在测试数据上完成的。
使用这样的设置,您的报告会显示您疯狂地过拟合。
您应该做的是将经过训练的模型应用到您的测试数据上,然后查看该模型的表现。
我有一个分类问题,我想确定哪些潜在借款人不应被邀请参加银行会议。 在数据中,ca。不应邀请 25% 的借款人。 我有大约 4500 个观察值和 86 个特征(很多假人)。
清理数据后,我做:
# Separate X_train and Y_train
X = ratings_prepared[:, :-1]
y= ratings_prepared[:,-1]
##################################################################################
# Separate test and train (stratified, 20% test)
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone
skfolds = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_index, test_index in skfolds.split(X,y):
X_train = X[train_index]
y_train = y[train_index]
X_test = X[test_index]
y_test = y[test_index]
然后,我继续训练模型。 SGD 分类器效果不佳:
def plot_roc_curve(fpr, tpr, label=None):
plt.plot(fpr, tpr, linewidth=2, label =label)
plt.plot([0,1], [0,1],'k--')
plt.axis([0,1,0,1])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1],"b--", label="Precision")
plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
plt.xlabel("Threashold")
plt.legend(loc="center left")
plt.ylim([0,1])
############################# Train Models #############################
from sklearn.linear_model import SGDClassifier
sgd_clf =SGDClassifier(random_state=42)
sgd_clf.fit(X_train,y_train)
y_pred = sgd_clf.predict(X_train)
# f1 score
f1_score(y_train, y_pred)
# confusion matrix
tn, fp, fn, tp = confusion_matrix(y_train, y_pred).ravel()
(tn, fp, fn, tp)
from sklearn.metrics import plot_confusion_matrix
import matplotlib.pyplot as plt
disp = plot_confusion_matrix(sgd_clf, X_train, y_train,
cmap=plt.cm.Blues,
normalize='true')
# recall and precision
from sklearn.metrics import precision_score, recall_score
precision_score(y_train, y_pred)
# Precision Recall
from sklearn.metrics import precision_recall_curve
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.show()
# Plot ROC curve
y_scores = cross_val_predict(sgd_clf, X_train, y_train, cv=3, method="decision_function")
fpr, tpr, thresholds = roc_curve(y_train, y_scores)
plot_roc_curve(fpr, tpr)
plt.show()
# recall and precision
from sklearn.metrics import precision_score, recall_score
precision_score(y_train, y_pred)
### Precision score: 0.5084427767354597
然后我转向随机森林分类器,它应该改进 SGD
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train, cv=3, method='predict_proba')
y_scores_forest = y_probas_forest[:,1]
fpr_forest, tpr_forest, threshold_forest = roc_curve(y_train,y_scores_forest)
plt.plot(fpr, tpr, "b:", label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="lower right")
plt.show()
的确,ROC 曲线看起来更好:
但是混淆矩阵和精度分数非常奇怪:
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train, cv=3, method='predict_proba')
y_scores_forest = y_probas_forest[:,1]
fpr_forest, tpr_forest, threshold_forest = roc_curve(y_train,y_scores_forest)
forest_clf.fit(X_train,y_train)
y_pred = forest_clf.predict(X_train)
# f1 score
f1_score(y_train, y_pred)
# confusion matrix
from sklearn.metrics import plot_confusion_matrix
import matplotlib.pyplot as plt
disp = plot_confusion_matrix(forest_clf, X_train, y_train,
cmap=plt.cm.Blues,
normalize='true')
F分数也是1,我不明白这是怎么回事。我怀疑我犯了一个错误,但 SGD 分类器似乎工作正常的事实让我认为这与数据清理无关。
知道可能出了什么问题吗?
#更新:
1) 混淆矩阵的绝对值:
2) 降低阈值:
你得到满分的原因是你没有在测试数据上做指标。
在第一段中,您对训练数据和测试数据进行了 80/20 分割,但是随后所有指标 ROC、混淆矩阵等都是在原始训练数据上而不是在测试数据上完成的。
使用这样的设置,您的报告会显示您疯狂地过拟合。
您应该做的是将经过训练的模型应用到您的测试数据上,然后查看该模型的表现。