随机森林过拟合

Question

我正在使用带有分层 CV 的 scikit-learn 来比较一些分类器。我在计算：准确性、召回率、auc。

我用 5 CV 进行参数优化 GridSearchCV。

RandomForestClassifier(warm_start= True, min_samples_leaf= 1, n_estimators= 800, min_samples_split= 5,max_features= 'log2', max_depth= 400, class_weight=None)

是来自 GridSearchCV 的 best_params。

我的问题，我觉得我真的过拟合了。例如：

Random Forest with standard deviation (+/-)

precision: 0.99 (+/- 0.06)

sensitivity: 0.94 (+/- 0.06)

specificity: 0.94 (+/- 0.06)

B_accuracy: 0.94 (+/- 0.06)

AUC: 0.94 (+/- 0.11)

Logistic Regression with standard deviation (+/-)

precision: 0.88(+/- 0.06)

sensitivity: 0.79 (+/- 0.06)

specificity: 0.68 (+/- 0.06)

B_accuracy: 0.73 (+/- 0.06)

AUC: 0.73 (+/- 0.041)

其他的看起来也像逻辑回归（所以他们看起来没有过度拟合）。

我的简历代码是：

for i,j in enumerate(data):
    X.append(data[i][0])
    y.append(float(data[i][1]))
x=np.array(X)
y=np.array(y)

def SD(values):

    mean=sum(values)/len(values)
    a=[]
    for i in range(len(values)):
        a.append((values[i]-mean)**2)
    erg=sum(a)/len(values)
    SD=math.sqrt(erg)
    return SD,mean

    for name, clf in zip(titles,classifiers):
    # go through all classifiers, compute 10 folds 
    # the next for loop should be 1 tab indent more, coudlnt realy format it here, sorry
    pre,sen,spe,ba,area=[],[],[],[],[]
    for train_index, test_index in skf:
        #print train_index, test_index
        #get the index from all train_index and test_index
        #change them to list due to some errors
        train=train_index.tolist()
        test=test_index.tolist()
        X_train=[]
        X_test=[]
        y_train=[]
        y_test=[]
        for i in train:
            X_train.append(x[i])

        for i in test:
            X_test.append(x[i]) 

        for i in train:
            y_train.append(y[i])

        for i in test:
            y_test.append(y[i]) 


        #clf=clf.fit(X_train,y_train)
        #predicted=clf.predict_proba(X_test)
        #... other code, calculating metrics and so on...
    print name 
    print("precision: %0.2f \t(+/- %0.2f)" % (SD(pre)[1], SD(pre)[0]))
    print("sensitivity: %0.2f \t(+/- %0.2f)" % (SD(sen)[1], SD(pre)[0]))
    print("specificity: %0.2f \t(+/- %0.2f)" % (SD(spe)[1], SD(pre)[0]))
    print("B_accuracy: %0.2f \t(+/- %0.2f)" % (SD(ba)[1], SD(pre)[0]))
    print("AUC: %0.2f \t(+/- %0.2f)" % (SD(area)[1], SD(area)[0]))
    print "\n"

如果我使用 scores = cross_validation.cross_val_score(clf, X, y, cv=10, scoring='accuracy') 方法，我不会得到这个 "overfitting" 值。那么也许我使用的 CV 方法有问题？但它仅适用于射频...

由于 cross_val_function 中特异性评分函数的滞后，我自己做了。

Answer 1

赫伯特，

如果你的目标是比较不同的学习算法，我建议你使用嵌套交叉验证。（我将学习算法称为不同的算法，例如逻辑回归、决策树和其他从您的训练数据中学习假设或模型（最终分类器）的判别模型。

"Regular" 如果您想调整单个算法的超参数，交叉验证很好。但是，一旦您开始运行使用相同的交叉验证 parameters/folds 进行超参数优化，您的性能估计可能会过于乐观。如果您运行一遍又一遍地进行交叉验证，那么您的测试数据将在某种程度上变成 "training data" 。

实际上，人们经常问我这个问题，我将从我在这里发布的常见问题解答部分摘录一些内容：http://sebastianraschka.com/faq/docs/evaluate-a-model.html

In nested cross-validation, we have an outer k-fold cross-validation loop to split the data into training and test folds, and an inner loop is used to select the model via k-fold cross-validation on the training fold. After model selection, the test fold is then used to evaluate the model performance. After we have identified our "favorite" algorithm, we can follow-up with a "regular" k-fold cross-validation approach (on the complete training set) to find its "optimal" hyperparameters and evaluate it on the independent test set. Let's consider a logistic regression model to make this clearer: Using nested cross-validation you will train m different logistic regression models, 1 for each of the m outer folds, and the inner folds are used to optimize the hyperparameters of each model (e.g., using gridsearch in combination with k-fold cross-validation. If your model is stable, these m models should all have the same hyperparameter values, and you report the average performance of this model based on the outer test folds. Then, you proceed with the next algorithm, e.g., an SVM etc.

我只能强烈推荐这篇更详细地讨论这个问题的优秀论文：

S。 Varma 和 R. 西蒙。使用交叉验证进行模型选择时误差估计的偏差。 BMC 生物信息学, 7(1):91, 2006. (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1397873/)

PS：通常，您不会 need/want 调整随机森林的超参数（如此广泛）。随机森林（一种装袋形式）背后的想法实际上是不 p运行e 决策树——实际上，Breiman 提出随机森林算法的一个原因是处理 p运行ing issue/overfitting 个单独的决策树。因此，您真正需要 "worry" 的唯一参数是树的数量（可能还有每棵树的随机特征数量）。但是，通常情况下，您最好接受大小为 n 的 bootstrap 样本训练（其中 n 是训练集中特征的原始数量）和平方根 (m) 特征（其中 m 是训练的维数设置).

希望对您有所帮助！

编辑：

通过 scikit-learn 进行嵌套 CV 的一些示例代码：

pipe_svc = Pipeline([('scl', StandardScaler()),
                     ('clf', SVC(random_state=1))])

param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]

param_grid = [{'clf__C': param_range, 
               'clf__kernel': ['linear']},
             {'clf__C': param_range, 
               'clf__gamma': param_range, 
               'clf__kernel': ['rbf']}]


# Nested Cross-validation (here: 5 x 2 cross validation)
# =====================================
gs = GridSearchCV(estimator=pipe_svc, 
                            param_grid=param_grid, 
                            scoring='accuracy', 
                            cv=5)
scores = cross_val_score(gs, X_train, y_train, scoring='accuracy', cv=2)
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))

随机森林过拟合

Random Forest is overfitting

python

machine-learning

random-forest

scikit-learn