如何将验证集纳入机器学习?
How to incorporate the validation set in machine learning?
我正在尝试学习机器学习,但我无法理解何时以及如何使用验证集。我知道它用于在检查测试集之前评估候选模型,但我不明白如何在代码中正确编写它。以我正在处理的这段代码为例:
# Split the set into train, validation, and test set (70:15:15 for train:valid:test)
X_train, X_rem, y_train, y_rem = train_test_split(X,y, train_size=0.7) # Split the data in training and remaining set
X_valid, X_test, y_valid, y_test = train_test_split(X_rem,y_rem, test_size=0.5) # Split the remaining data 50/50 into validation and test set
print("Properties (shapes):\nTraining set: {}\nValidation set: {}\nTest set: {}".format(X_train.shape, X_valid.shape, X_test.shape))
import warnings # supress warnings
warnings.filterwarnings('ignore')
# SCALING
std = StandardScaler()
minmax = MinMaxScaler()
rob = RobustScaler()
# Transforming the TRAINING set
X_train_Standard = std.fit_transform(X_train) # Standardization: each value has mean = 0 and std = 1
X_train_MinMax = minmax.fit_transform(X_train) # Normalization: each value is between 0 and 1
X_train_Robust = rob.fit_transform(X_train) # Robust scales each values variance and quartiles (ignores outliers)
# Transforming the TEST set
X_test_Standard = std.fit_transform(X_test)
X_test_MinMax = minmax.fit_transform(X_test)
X_test_Robust = rob.fit_transform(X_test)
# Test scalers for decision tree classifier
treeStd = DecisionTreeRegressor(max_depth=3, random_state=0).fit(X_train_Standard, y_train)
treeMinMax = DecisionTreeRegressor(max_depth=3, random_state=0).fit(X_train_MinMax, y_train)
treeRobust = DecisionTreeRegressor(max_depth=3, random_state=0).fit(X_train_Robust, y_train)
print("Decision tree with standard scaler:\nTraining set score: {:.4f}\nTest set score: {:.4f}\n".format(treeStd.score(X_train_Standard, y_train), treeStd.score(X_test_Standard, y_test)))
print("Decision tree with min/max scaler:\nTraining set score: {:.4f}\nTest set score: {:.4f}\n".format(treeMinMax.score(X_train_MinMax, y_train), treeMinMax.score(X_test_MinMax, y_test)))
print("Decision tree with robust scaler:\nTraining set score: {:.4f}\nTest set score: {:.4f}\n".format(treeRobust.score(X_train_Robust, y_train), treeRobust.score(X_test_Robust, y_test)))
# Now we train our model for different values of `max_depth`, ranging from 1 to 20.
max_depths = range(1, 30)
training_error = []
for max_depth in max_depths:
model_1 = DecisionTreeRegressor(max_depth=max_depth)
model_1.fit(X,y)
training_error.append(mean_squared_error(y, model_1.predict(X)))
testing_error = []
for max_depth in max_depths:
model_2 = DecisionTreeRegressor(max_depth=max_depth)
model_2.fit(X, y)
testing_error.append(mean_squared_error(y_test, model_2.predict(X_test)))
plt.plot(max_depths, training_error, color='blue', label='Training error')
plt.plot(max_depths, testing_error, color='green', label='Testing error')
plt.xlabel('Tree depth')
plt.axvline(x=25, color='orange', linestyle='--')
plt.annotate('optimum = 25', xy=(20, 20), color='red')
plt.ylabel('Mean squared error')
plt.title('Hyperparameters tuning', pad=20, size=30)
plt.legend()
我在哪里 运行 验证集上的测试?如何将其合并到代码中?
首先确保只创建一个模型,继续使用这个模型。目前,您在每个训练步骤中创建一个模型并覆盖旧模型。否则你的模型将永远不会改进。
其次:验证集背后的想法是评估你的训练进度,看看你的模型如何处理它以前没有见过的数据。因此,您需要将其纳入您的培训过程。
所以在你的情况下它看起来像那样。
model = DecisionTreeRegressor(max_depth=max_depth) # here we create the model we want to use
for max_depth in max_depths:
model.fit(X_train,y_train) # here we train the model
training_error.append(mean_squared_error(y_train, model.predict(X_train))) # here we calculate the training error
val_error.append(mean_squared_error(y_val, model.predict(X_val))) # here we calculate the validation error
test_error = mean_squared_error(y_test, model.predict(X_test)) # here we calculate the test error
确保你只训练你的训练数据,而不是你的验证或测试数据。
我正在尝试学习机器学习,但我无法理解何时以及如何使用验证集。我知道它用于在检查测试集之前评估候选模型,但我不明白如何在代码中正确编写它。以我正在处理的这段代码为例:
# Split the set into train, validation, and test set (70:15:15 for train:valid:test)
X_train, X_rem, y_train, y_rem = train_test_split(X,y, train_size=0.7) # Split the data in training and remaining set
X_valid, X_test, y_valid, y_test = train_test_split(X_rem,y_rem, test_size=0.5) # Split the remaining data 50/50 into validation and test set
print("Properties (shapes):\nTraining set: {}\nValidation set: {}\nTest set: {}".format(X_train.shape, X_valid.shape, X_test.shape))
import warnings # supress warnings
warnings.filterwarnings('ignore')
# SCALING
std = StandardScaler()
minmax = MinMaxScaler()
rob = RobustScaler()
# Transforming the TRAINING set
X_train_Standard = std.fit_transform(X_train) # Standardization: each value has mean = 0 and std = 1
X_train_MinMax = minmax.fit_transform(X_train) # Normalization: each value is between 0 and 1
X_train_Robust = rob.fit_transform(X_train) # Robust scales each values variance and quartiles (ignores outliers)
# Transforming the TEST set
X_test_Standard = std.fit_transform(X_test)
X_test_MinMax = minmax.fit_transform(X_test)
X_test_Robust = rob.fit_transform(X_test)
# Test scalers for decision tree classifier
treeStd = DecisionTreeRegressor(max_depth=3, random_state=0).fit(X_train_Standard, y_train)
treeMinMax = DecisionTreeRegressor(max_depth=3, random_state=0).fit(X_train_MinMax, y_train)
treeRobust = DecisionTreeRegressor(max_depth=3, random_state=0).fit(X_train_Robust, y_train)
print("Decision tree with standard scaler:\nTraining set score: {:.4f}\nTest set score: {:.4f}\n".format(treeStd.score(X_train_Standard, y_train), treeStd.score(X_test_Standard, y_test)))
print("Decision tree with min/max scaler:\nTraining set score: {:.4f}\nTest set score: {:.4f}\n".format(treeMinMax.score(X_train_MinMax, y_train), treeMinMax.score(X_test_MinMax, y_test)))
print("Decision tree with robust scaler:\nTraining set score: {:.4f}\nTest set score: {:.4f}\n".format(treeRobust.score(X_train_Robust, y_train), treeRobust.score(X_test_Robust, y_test)))
# Now we train our model for different values of `max_depth`, ranging from 1 to 20.
max_depths = range(1, 30)
training_error = []
for max_depth in max_depths:
model_1 = DecisionTreeRegressor(max_depth=max_depth)
model_1.fit(X,y)
training_error.append(mean_squared_error(y, model_1.predict(X)))
testing_error = []
for max_depth in max_depths:
model_2 = DecisionTreeRegressor(max_depth=max_depth)
model_2.fit(X, y)
testing_error.append(mean_squared_error(y_test, model_2.predict(X_test)))
plt.plot(max_depths, training_error, color='blue', label='Training error')
plt.plot(max_depths, testing_error, color='green', label='Testing error')
plt.xlabel('Tree depth')
plt.axvline(x=25, color='orange', linestyle='--')
plt.annotate('optimum = 25', xy=(20, 20), color='red')
plt.ylabel('Mean squared error')
plt.title('Hyperparameters tuning', pad=20, size=30)
plt.legend()
我在哪里 运行 验证集上的测试?如何将其合并到代码中?
首先确保只创建一个模型,继续使用这个模型。目前,您在每个训练步骤中创建一个模型并覆盖旧模型。否则你的模型将永远不会改进。
其次:验证集背后的想法是评估你的训练进度,看看你的模型如何处理它以前没有见过的数据。因此,您需要将其纳入您的培训过程。
所以在你的情况下它看起来像那样。
model = DecisionTreeRegressor(max_depth=max_depth) # here we create the model we want to use
for max_depth in max_depths:
model.fit(X_train,y_train) # here we train the model
training_error.append(mean_squared_error(y_train, model.predict(X_train))) # here we calculate the training error
val_error.append(mean_squared_error(y_val, model.predict(X_val))) # here we calculate the validation error
test_error = mean_squared_error(y_test, model.predict(X_test)) # here we calculate the test error
确保你只训练你的训练数据,而不是你的验证或测试数据。