这两种为 sklearn GPR 指定 training/testing 数据的方式有什么区别

Question

这是对 my previous question 关于评估我的 scikit 高斯过程回归器的跟进。我对 GPR 很陌生，我认为我在使用训练数据和测试数据的方式上可能犯了方法论上的错误。

本质上，我想知道通过在测试数据和训练数据之间拆分输入来指定训练数据之间的区别是什么：

X = np.atleast_2d(some_data).T
Y = np.atleast_2d(other_data).T

X_train, X_test, y_train, y_test = train_test_split(X, Y,
                                                    test_size = 0.33,
                                                    random_state = 0)

kernel = ConstantKernel() + Matern() + WhiteKernel(noise_level=1)
gp = gaussian_process.GaussianProcessRegressor(
                                alpha=1e-10,
                                copy_X_train=True,
                                kernel = kernel,
                                n_restarts_optimizer=10,
                                normalize_y=False,
                                random_state=None)
gp.fit(X_train, y_train)
score = gp.score(X_test, y_test)
print(score)

x_pred = np.atleast_2d(np.linspace(0,10,1000)).T
y_pred, sigma = gp.predict(x_pred, return_std=True)

与使用完整数据集进行训练相比。

X = np.atleast_2d(some_data).T
Y = np.atleast_2d(other_data).T

kernel = ConstantKernel() + Matern() + WhiteKernel(noise_level=1)
gp = gaussian_process.GaussianProcessRegressor(
                                alpha=1e-10,
                                copy_X_train=True,
                                kernel = kernel,
                                n_restarts_optimizer=10,
                                normalize_y=False,
                                random_state=None)
gp.fit(X, Y)
score = gp.score(X, Y)
print(score)

x_pred = np.atleast_2d(np.linspace(0,10,1000)).T
y_pred, sigma = gp.predict(x_pred, return_std=True)

这些选项之一是否会导致错误的预测？

Answer 1

你从测试数据中分离出训练数据来评估你的模型，否则你不知道你是否过度拟合了数据。例如，将数据放在 excel 中，并用平滑的线绘制。从技术上讲，excel 中的样条函数是一个完美的模型，但对于预测新值毫无用处。

在您的示例中，您的预测是统一的 space 以允许您可视化模型认为的基础函数。但这对于理解模型的通用性毫无用处。有时您可以在训练数据上获得非常高的准确率 (> 95%)，而在测试数据上获得的准确率较低，这意味着模型过度拟合。

除了绘制统一预测 space 以可视化模型外，您还应该预测测试集中的值，然后查看测试和训练数据的准确性指标。

这两种为 sklearn GPR 指定 training/testing 数据的方式有什么区别

What is the difference between these two ways of specifying training/testing data for sklearn GPR

python

gaussian

scikit-learn