gridsearch.predict_proba 结果是列表而不是数组

Question

我按照 example 并尝试使用带有随机森林分类器的网格搜索来生成 roc_auc_score，但是，y_prob=model.predict_proba(X_test ) 我生成的是列表（两个数组）而不是一个。所以我想知道这里出了什么问题。

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit
from sklearn.metrics import roc_auc_score

X = np.random.rand(50,10)
y = np.random.permutation([1] * 25 + [0] * 25)

y= label_binarize(y, classes=[0, 1])
y= np.hstack((1-y, y))

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=7)  
index_split = sss.split(X, y)
train_index = []
test_index = []
for train_ind, test_ind in index_split:
train_index.extend(train_ind)
test_index.extend(test_ind)

data_train = X[train_index]
out_train = y[train_index]
data_test = X[test_index]
out_test = y[test_index]

rf = RandomForestClassifier()
grids = {
     'n_estimators': [10, 50, 100, 200],   
     'max_features': ['auto', 'sqrt', 'log2'], 
     'criterion': ['gini', 'entropy']
        }
rf_grids_searched = GridSearchCV(rf, 
                                grids, 
                                scoring = "roc_auc",
                                n_jobs = -1,
                                refit=True,
                                cv = 5,
                                verbose=10)

rf_grids_searched.fit(data_train, out_train)
rf_best = rf_grids_searched.best_estimator_

y_prob=rf_best.predict_proba(data_test)
print(roc_auc_score(out_test, y_prob))

我的结果：

array([[0.5, 0.5],
    [0.5, 0.5],
    [0.7, 0.3],
    [0.3, 0.7],
    [0.7, 0.3],
    [0.5, 0.5],
    [0.1, 0.9],
    [0.6, 0.4],
    [0.6, 0.4],
    [0.4, 0.6]]), array([[0.5, 0.5],
    [0.5, 0.5],
    [0.3, 0.7],
    [0.7, 0.3],
    [0.3, 0.7],
    [0.5, 0.5],
    [0.9, 0.1],
    [0.4, 0.6],
    [0.4, 0.6],
    [0.6, 0.4]])]

概率为 [0,1] 的预期结果：

    array([[0.5, 0.5],
    [0.5, 0.5],
    [0.7, 0.3],
    [0.3, 0.7],
    [0.7, 0.3],
    [0.5, 0.5],
    [0.1, 0.9],
    [0.6, 0.4],
    [0.6, 0.4],

我也试过先不二值化y，然后训练gridsearch得到下面的数组y_prob。后来我二值化y_test匹配y_prob的维度，得到分数。我想知道顺序是否正确？代码：

  out_test1= label_binarize(out_test, classes=[0, 1])
  out_test1= np.hstack((1-out_test1, out_test1))
  print(roc_auc_score(out_test1, y_prob))   

   array([[0.6, 0.4],
   [0.5, 0.5],
   [0.6, 0.4],
   [0.5, 0.5],
   [0.7, 0.3],
   [0.3, 0.7],
   [0.8, 0.2],
   [0.4, 0.6],
   [0.8, 0.2],
   [0.4, 0.6]])

Answer 1

网格搜索的 predict_proba 方法只是对最佳估计器 predict_proba 的调度。从 the docstring 到 RandomForestClassifier.predict_proba（重点添加）：

Returns

p : ndarray of shape (n_samples, n_classes), or a list of n_outputs such arrays if n_outputs > 1. ...

由于您已经指定了两个输出（y 中的两列），您会得到两个类 中每个 的预测概率两个目标。

gridsearch.predict_proba 结果是列表而不是数组

gridsearch.predict_proba results in list rather than array

random-forest

roc

scikit-learn

gridsearchcv