为什么我的平均值 cross_val_score 与训练集和测试集上的 R 方值如此不同?
Why is my average cross_val_score so different than my R-square values on my training and test sets?
我是 运行 一个回归问题并评估其性能。我想知道我的 R 平方值为何与我的交叉验证分数如此不同。这是过度拟合的迹象吗?这是我的设置示例。 X和Y分别预定义为特征和目标。
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=.2, random_state=47)
knn = neighbors.KNeighborsRegressor(n_neighbors=40, weights='distance')
knn.fit(X_train, y_train)
y_preds_train = knn.predict(X_train)
y_preds_test = knn.predict(X_test)
print('R square of training set:', knn.score(X_train, y_train))
print('_____________________Test Stats_____________________')
print('R square of test in the model:', knn.score(X_test, y_test))
print('MAE:', mean_absolute_error(y_test, y_preds_test))
print('MSE:', mse(y_test, y_preds_test))
print('RMSE:', rmse(y_test, y_preds_test))
print('MAPE:', np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100)
score = cross_val_score(knn, X, Y, cv=5)
print("Cross Val Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std()*2))
score
结果:
R square of training set: 0.9881595480397585
_____________________Test Stats_____________________
R square of test in the model: 0.8611300681864155
MAE: 7.488081625869961
MSE: 164.64697808634588
RMSE: 12.831483861438079
MAPE: 368.35904890846416
Cross Val Accuracy: 0.65 (+/- 0.21)
array([0.58122339, 0.53346581, 0.8312428 , 0.69213113, 0.61482638])
虽然很难从这些结果中推断出来,但您可能需要检查一些内容。
- 将 n_neighbors 的值更改为超出范围并检查 Cross val 精度如何变化甚至 运行 GridSearchCV
- 将cross_val_score中的cv改成10,可以看出有一段达到了0.83的准确率,而其余的都低于0.7,这很令人吃惊。当你这样做时也改变 test_size = 0.1
- 确保X中的数据是规范化的
我是 运行 一个回归问题并评估其性能。我想知道我的 R 平方值为何与我的交叉验证分数如此不同。这是过度拟合的迹象吗?这是我的设置示例。 X和Y分别预定义为特征和目标。
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=.2, random_state=47)
knn = neighbors.KNeighborsRegressor(n_neighbors=40, weights='distance')
knn.fit(X_train, y_train)
y_preds_train = knn.predict(X_train)
y_preds_test = knn.predict(X_test)
print('R square of training set:', knn.score(X_train, y_train))
print('_____________________Test Stats_____________________')
print('R square of test in the model:', knn.score(X_test, y_test))
print('MAE:', mean_absolute_error(y_test, y_preds_test))
print('MSE:', mse(y_test, y_preds_test))
print('RMSE:', rmse(y_test, y_preds_test))
print('MAPE:', np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100)
score = cross_val_score(knn, X, Y, cv=5)
print("Cross Val Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std()*2))
score
结果:
R square of training set: 0.9881595480397585 _____________________Test Stats_____________________ R square of test in the model: 0.8611300681864155 MAE: 7.488081625869961 MSE: 164.64697808634588 RMSE: 12.831483861438079 MAPE: 368.35904890846416 Cross Val Accuracy: 0.65 (+/- 0.21) array([0.58122339, 0.53346581, 0.8312428 , 0.69213113, 0.61482638])
虽然很难从这些结果中推断出来,但您可能需要检查一些内容。
- 将 n_neighbors 的值更改为超出范围并检查 Cross val 精度如何变化甚至 运行 GridSearchCV
- 将cross_val_score中的cv改成10,可以看出有一段达到了0.83的准确率,而其余的都低于0.7,这很令人吃惊。当你这样做时也改变 test_size = 0.1
- 确保X中的数据是规范化的