使用 imblearn 绘制 ROC 曲线

Question

我正在尝试使用 imblearn 绘制 ROC 曲线，但运行遇到了一些问题。

这是我的数据截图

from imblearn.over_sampling import SMOTE, ADASYN
from collections import Counter
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle
import sys
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from scipy import interp
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
# Import some data to play with
df = pd.read_csv("E:\autodesk\Hourly and weather ml.csv")
# X and y are different columns of the input data. Input X as numpy array
X = df[['TTI','Max TemperatureF','Mean TemperatureF','Min TemperatureF',' Min Humidity']].values
# # Reshape X. Do this if X has only one value per data point. In this case, TTI.

# # Input y as normal list
y = df['TTI_Category'].as_matrix()

X_resampled, y_resampled = SMOTE().fit_sample(X, y)

y_resampled = label_binarize(y_resampled, classes=['Good','Bad','Ok'])
n_classes = y_resampled.shape[1]

# shuffle and split training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
                                                    random_state=0)

# Learn to predict each class against the other
classifier = OneVsRestClassifier(DecisionTreeClassifier(random_state=0))
y_score=classifier.fit(X_resampled, y_resampled).predict_proba(X_test)

# Compute ROC curve and ROC area for each class

fpr = dict()
tpr = dict()

roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())

roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

plt.figure()

我将原来的X_train and y_train改成了X_resampled, y_resampled，因为训练应该在重采样数据集上进行，而测试需要在原始测试数据集上进行。但是我得到了以下回溯`

runfile('E:/autodesk/SMOTE with multiclass.py', wdir='E:/autodesk')
Traceback (most recent call last):

  File "<ipython-input-128-efb16ffc92ca>", line 1, in <module>
    runfile('E:/autodesk/SMOTE with multiclass.py', wdir='E:/autodesk')

  File "C:\Users\Think\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfile
    execfile(filename, namespace)

  File "C:\Users\Think\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 87, in execfile
    exec(compile(scripttext, filename, 'exec'), glob, loc)

  File "E:/autodesk/SMOTE with multiclass.py", line 51, in <module>
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])

IndexError: too many indices for array

我添加了另一行来对 y_resampled 和原始 y 进行二值化，其他一切保持不变，但我不确定我是否正在拟合重采样数据并测试原始数据

X_resampled, y_resampled = SMOTE().fit_sample(X, y)

y_resampled = label_binarize(y_resampled, classes=['Good','Bad','Ok'])

y = label_binarize(y, classes=['Good','Bad','Ok'])
n_classes = y.shape[1]

非常感谢您的帮助。

Answer 1

首先让我们讨论错误。您正在这样做：

y_resampled = label_binarize(y_resampled, classes=['Good','Bad','Ok'])
n_classes = y_resampled.shape[1]

所以你的 n_classes 实际上是 3.

在后续部分中，您这样做了：

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
                                                random_state=0)

这里你用的是原来的y，不是y_resampled。因此 y_test 当前是一个形状为 (n_samples,) 的一维数组，或者可能是形状为 (n_samples, 1) 的列向量。

在 for 循环中，您开始从 0 迭代到 3 (n_classes)，这对于 y_test 是不可能的，因此您尝试在 [=16 中访问的索引出现错误=] 不存在。

其次，您应该先将数据拆分为训练和测试，然后仅对训练部分进行重采样。

因此这段代码应该可以满足您的要求：

# First divide the data into train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
                                                    random_state=0)

# Then only resample the training data
X_resampled, y_resampled = SMOTE().fit_sample(X_train, y_train)

# Then label binarize them to be used in multi-class roc
y_resampled = label_binarize(y_resampled, classes=['Good','Bad','Ok'])

# Do this to the test data too
y_test = label_binarize(y_test, classes=['Good','Bad','Ok'])

y_score=classifier.fit(X_resampled, y_resampled).predict_proba(X_test)

# Then you can do this and other parts of code
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

使用 imblearn 绘制 ROC 曲线

use imblearn to plot ROC curve

python

machine-learning

roc

scikit-learn

imblearn