为什么交叉验证结果显示准确率很高，但存在过拟合？

Question

我正在使用随机树算法解决二进制 class化问题。训练集包含 70k 个值作为“0”class，只有 3k 个值作为“1”。此外，X_test 的预测结果应该给出相同数量的“0”和“1”。

clf = RandomForestClassifier(random_state=1, n_estimators=350, min_samples_split=6, min_samples_leaf=2)
scores = cross_validation.cross_val_score(clf, x_train, y_train, cv=cv) 
print("Accuracy (random forest): {}+/-{}".format(scores.mean(), scores.std()))

Accuracy (random forest): 0.960755941369/1.40500919606e-06

clf.fit(x_train, y_train)
prediction_final = clf.predict(X_test) # this return Target values: 76k Zeroes and only 15 ones 


#x_test is 10% of x_train set
preds_test = clf.predict(x_test)
print "precision_score", precision_score(y_test, preds_final)
print "recall_score", recall_score(y_test, preds_final)

precision_score 0.0; recall_score 0.0

confusion_matrix [[7279 1] [ 322 0]]

据我所知，存在过拟合问题，但为什么交叉验证检测不到呢？甚至标准差也很低。那么我该如何解决这个问题呢？

P.S。我尝试使用“0”获取 3k 行，使用“1”获取 3k 行 - 作为训练集，模型要好得多，但这不是解决方案。

Answer 1

（总体）准确性 对于像您这样的不平衡数据集几乎是无用的衡量标准，因为它计算的是正确预测的百分比。在您的情况下，想象一个什么也学不到的分类器，但总是预测“0”。由于您有 70k 个零而只有 3k 个，因此该分类器的准确度得分将达到 70/73 = 95.9%.

检查 Confusion Matrix 通常有助于披露此类 "classifier"。

因此，您绝对应该使用另一种方法来量化分类质量。 平均准确度 是一个选项，因为它计算所有类的平均准确度。在二分类的情况下，也叫Balanced Accuracy and results in computing (TP/P + TN/N)/2, so that the classifier imagined above, which always predicts "0", would only score (100% + 0%) / 2 = 50%. However, that measure seems to be not implemented in scikit-learn. Though you could implement such a scoring function by yourself, it will probably be easier and faster to use one of the other predefined scorers.

例如，您可以通过将 scoring = 'f1' 传递给 cross_validation.cross_val_score 来计算 F1 Score 而不是准确度。 F1 分数同时考虑了精度和召回率。

为什么交叉验证结果显示准确率很高，但存在过拟合？

Why cross validation result shows high accuracy while there is overfitting?

python

machine-learning

python-2.7

random-forest

scikit-learn