为什么我在训练数据上得到不同的分数？

Question

我正在尝试使用 scikit-learn 构建一个优化的 SVM 分类模型，我在 Python 方面还很陌生，一般来说，我并不是真正的 ML。这是我使用的代码：

# Training the SVM model on the Training set
from sklearn.svm import SVC
classifier = SVC()

# define evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=1)
# define search space
space = dict()
space['kernel'] = ["linear", "rbf", "sigmoid", "poly"]
space['C'] = [0.1, 1, 10, 100, 1000]
space['gamma'] = [1, 0.1, 0.01, 0.001, 0.0001]
space['tol'] = [1e-3, 1e-4, 1e-5, 1e-6]

# define search
search = RandomizedSearchCV(classifier, space, n_iter=500, scoring='accuracy', n_jobs=-1, cv=cv, random_state=1)

# execute search
result = search.fit(X_train, y_train)

# summarize result
print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)
bestModel = result.best_estimator_

#Test
a = X_train
b = y_train
grid_predictions = bestModel.predict(a)
accuracy_score(b, grid_predictions)

我正在尝试了解我的训练数据的分类情况。我的问题是：为什么我从 result.best_score_（这是最好的搜索模型的准确性）和 accuracy_score(b, grid_predictions)（这是将准确的训练数据馈送到性能最好的模型的地方）得到不同的准确性输出？

Answer 1

差异是因为 best_score_ 显示最佳估计器的最佳分数（在您的情况下为准确度），“在遗漏数据上给出最高分数（或最小损失，如果指定）的估计器”。遗漏的数据来自您的交叉验证，这意味着 是您的 CV 的一个看不见的折叠的准确性（请记住，这发生在 RandomizedSearchCV 内）。

另一方面，accuracy_score(b, grid_predictions) 的输出是由同一个预测器计算的，但在看不见的数据上：不是折叠，而是使用你所有的训练数据 （基于您提供的代码）。

这意味着这两个指标的计算方式相同，使用相同的模型，但对不同的数据集进行预测。

为什么我在训练数据上得到不同的分数？

Why am I getting different scores on training data?

python

machine-learning

svm

scikit-learn

grid-search