如何将验证集纳入机器学习?

How to incorporate the validation set in machine learning?

我正在尝试学习机器学习,但我无法理解何时以及如何使用验证集。我知道它用于在检查测试集之前评估候选模型,但我不明白如何在代码中正确编写它。以我正在处理的这段代码为例:

# Split the set into train, validation, and test set (70:15:15 for train:valid:test)
X_train, X_rem, y_train, y_rem = train_test_split(X,y, train_size=0.7)          # Split the data in training and remaining set
X_valid, X_test, y_valid, y_test = train_test_split(X_rem,y_rem, test_size=0.5) # Split the remaining data 50/50 into validation and test set

print("Properties (shapes):\nTraining set: {}\nValidation set: {}\nTest set: {}".format(X_train.shape, X_valid.shape, X_test.shape))

import warnings # supress warnings
warnings.filterwarnings('ignore')

# SCALING
std = StandardScaler()
minmax = MinMaxScaler()
rob = RobustScaler()

# Transforming the TRAINING set
X_train_Standard = std.fit_transform(X_train)   # Standardization: each value has mean = 0 and std = 1
X_train_MinMax = minmax.fit_transform(X_train)  # Normalization: each value is between 0 and 1
X_train_Robust = rob.fit_transform(X_train)     # Robust scales each values variance and quartiles (ignores outliers)

# Transforming the TEST set
X_test_Standard = std.fit_transform(X_test)
X_test_MinMax = minmax.fit_transform(X_test)
X_test_Robust = rob.fit_transform(X_test)

# Test scalers for decision tree classifier
treeStd = DecisionTreeRegressor(max_depth=3, random_state=0).fit(X_train_Standard, y_train)
treeMinMax = DecisionTreeRegressor(max_depth=3, random_state=0).fit(X_train_MinMax, y_train)
treeRobust = DecisionTreeRegressor(max_depth=3, random_state=0).fit(X_train_Robust, y_train)
print("Decision tree with standard scaler:\nTraining set score: {:.4f}\nTest set score: {:.4f}\n".format(treeStd.score(X_train_Standard, y_train), treeStd.score(X_test_Standard, y_test)))
print("Decision tree with min/max scaler:\nTraining set score: {:.4f}\nTest set score: {:.4f}\n".format(treeMinMax.score(X_train_MinMax, y_train), treeMinMax.score(X_test_MinMax, y_test)))
print("Decision tree with robust scaler:\nTraining set score: {:.4f}\nTest set score: {:.4f}\n".format(treeRobust.score(X_train_Robust, y_train), treeRobust.score(X_test_Robust, y_test)))

# Now we train our model for different values of `max_depth`, ranging from 1 to 20.

max_depths = range(1, 30)
training_error = []

for max_depth in max_depths:
    model_1 = DecisionTreeRegressor(max_depth=max_depth)
    model_1.fit(X,y)
    training_error.append(mean_squared_error(y, model_1.predict(X)))


testing_error = []
for max_depth in max_depths:
    model_2 = DecisionTreeRegressor(max_depth=max_depth)
    model_2.fit(X, y)
    testing_error.append(mean_squared_error(y_test, model_2.predict(X_test)))

plt.plot(max_depths, training_error, color='blue', label='Training error')
plt.plot(max_depths, testing_error, color='green', label='Testing error')
plt.xlabel('Tree depth')
plt.axvline(x=25, color='orange', linestyle='--')
plt.annotate('optimum = 25', xy=(20, 20), color='red')
plt.ylabel('Mean squared error')
plt.title('Hyperparameters tuning', pad=20, size=30)
plt.legend()

我在哪里 运行 验证集上的测试?如何将其合并到代码中?

首先确保只创建一个模型,继续使用这个模型。目前,您在每个训练步骤中创建一个模型并覆盖旧模型。否则你的模型将永远不会改进。

其次:验证集背后的想法是评估你的训练进度,看看你的模型如何处理它以前没有见过的数据。因此,您需要将其纳入您的培训过程。

所以在你的情况下它看起来像那样。

model = DecisionTreeRegressor(max_depth=max_depth) # here we create the model we want to use
for max_depth in max_depths:
    model.fit(X_train,y_train) # here we train the model
    training_error.append(mean_squared_error(y_train, model.predict(X_train))) # here we calculate the training error
    val_error.append(mean_squared_error(y_val, model.predict(X_val))) # here we calculate the validation error
test_error = mean_squared_error(y_test, model.predict(X_test)) # here we calculate the test error

确保你只训练你的训练数据,而不是你的验证或测试数据。