scikit-learn 中的多输出高斯过程回归

Multiple-output Gaussian Process regression in scikit-learn

我正在使用 scikit learn 进行高斯过程回归 (GPR) 操作来预测数据。我的训练数据如下:

x_train = np.array([[0,0],[2,2],[3,3]]) #2-D cartesian coordinate points

y_train = np.array([[200,250, 155],[321,345,210],[417,445,851]]) #observed output from three different datasources at respective input data points (x_train)

需要预测均值和variance/standard偏差的测试点(2-D)是:

xvalues = np.array([0,1,2,3])
yvalues = np.array([0,1,2,3])

x,y = np.meshgrid(xvalues,yvalues) #Total 16 locations (2-D)
positions = np.vstack([x.ravel(), y.ravel()]) 
x_test = (np.array(positions)).T

现在,运行 GPR (GausianProcessRegressor) 拟合后(这里,ConstantKernel 和 RBF 的乘积用作 GaussianProcessRegressor 中的 Kernel),均值和 variance/standard 可以通过以下代码行预测偏差:

y_pred_test, sigma = gp.predict(x_test, return_std =True)

在打印预测均值 (y_pred_test) 和方差 (sigma) 时,我在控制台中打印了以下输出:

在预测值(均值)中,打印内部数组中包含三个对象的 'nested array'。可以假设内部数组是每个数据源在每个二维测试点位置的预测平均值。但是,打印的方差仅包含一个包含 16 个对象的数组(可能用于 16 个测试位置点)。我知道方差表明了估计的不确定性。因此,我期待每个测试点每个数据源的预测方差。我的期望错了吗?如何在每个测试点获得每个数据源的预测方差?是因为密码错误吗?

首先,如果使用的参数是 "sigma",那指的是标准差,而不是方差(回想一下,方差只是标准差的平方)。

使用方差更容易概念化,因为方差被定义为从数据点到集合均值的欧几里德距离。

在您的例子中,您有一组二维点。如果您将这些视为二维平面上的点,那么方差就是每个点到均值的距离。标准偏差将是方差的正根。

在这种情况下,您有 16 个测试点和 16 个标准差值。这是完全有道理的,因为每个测试点都有自己定义的距集合平均值的距离。

如果你想计算一组点的方差,你可以通过单独求和每个点的方差,除以点数,然后减去均方。这个数字的正根将产生集合的标准偏差。

A​​SIDE:这也意味着如果你通过插入、删除或替换来改变集合,每个点的标准差都会改变。这是因为将重新计算均值以适应新数据。这个迭代过程是 k 均值聚类背后的基本力量。

嗯,你确实无意中撞到了冰山……

作为序曲,让我们明确方差和标准差的概念仅针对标量变量定义;对于向量变量(比如你自己的 3d 输出),方差的概念不再有意义,而是使用 协方差矩阵 (Wikipedia, Wolfram).

继续前奏,根据 predict 方法上的 scikit-learn docs,你的 sigma 的形状确实符合预期(即没有 coding 错误在你的情况下):

Returns:

y_mean : array, shape = (n_samples, [n_output_dims])

Mean of predictive distribution a query points

y_std : array, shape = (n_samples,), optional

Standard deviation of predictive distribution at query points. Only returned when return_std is True.

y_cov : array, shape = (n_samples, n_samples), optional

Covariance of joint predictive distribution a query points. Only returned when return_cov is True.

结合我之前关于协方差矩阵的评论,第一选择是尝试使用参数 return_cov=Truepredict 函数(因为要求 方差 向量变量是无意义的);但同样,这将导致 16x16 矩阵,而不是 3x3 矩阵(3 个输出变量的协方差矩阵的预期形状)...

弄清楚这些细节后,让我们继续讨论问题的本质。


问题的核心是在实践和相关教程中很少提及(甚至暗示)的事情:具有多个输出的高斯过程回归是 非常重要 并且仍然是一个活跃的研究领域。可以说,scikit-learn 无法真正处理这种情况,尽管它表面上看起来会这样做,至少不会发出一些相关警告。

让我们在近期科学文献中寻找对这一说法的一些佐证:

Gaussian process regression with multiple response variables (2015) - 引用(强调我的):

most GPR implementations model only a single response variable, due to the difficulty in the formulation of covariance function for correlated multiple response variables, which describes not only the correlation between data points, but also the correlation between responses. In the paper we propose a direct formulation of the covariance function for multi-response GPR, based on the idea that [...]

Despite the high uptake of GPR for various modelling tasks, there still exists some outstanding issues with the GPR method. Of particular interest in this paper is the need to model multiple response variables. Traditionally, one response variable is treated as a Gaussian process, and multiple responses are modelled independently without considering their correlation. This pragmatic and straightforward approach was taken in many applications (e.g. [7, 26, 27]), though it is not ideal. A key to modelling multi-response Gaussian processes is the formulation of covariance function that describes not only the correlation between data points, but also the correlation between responses.

Remarks on multi-output Gaussian process regression (2018) - 引用(强调原文):

Typical GPs are usually designed for single-output scenarios wherein the output is a scalar. However, the multi-output problems have arisen in various fields, [...]. Suppose that we attempt to approximate T outputs {f(t}, 1 ≤t ≤T , one intuitive idea is to use the single-output GP (SOGP) to approximate them individually using the associated training data D(t) = { X(t), y(t) }, see Fig. 1(a). Considering that the outputs are correlated in some way, modeling them individually may result in the loss of valuable information. Hence, an increasing diversity of engineering applications are embarking on the use of multi-output GP (MOGP), which is conceptually depicted in Fig. 1(b), for surrogate modeling.

The study of MOGP has a long history and is known as multivariate Kriging or Co-Kriging in the geostatistic community; [...] The MOGP handles problems with the basic assumption that the outputs are correlated in some way. Hence, a key issue in MOGP is to exploit the output correlations such that the outputs can leverage information from one another in order to provide more accurate predictions in comparison to modeling them individually.

Physics-Based Covariance Models for Gaussian Processes with Multiple Outputs (2013) - 引用:

Gaussian process analysis of processes with multiple outputs is limited by the fact that far fewer good classes of covariance functions exist compared with the scalar (single-output) case. [...]

The difficulty of finding “good” covariance models for multiple outputs can have important practical consequences. An incorrect structure of the covariance matrix can significantly reduce the efficiency of the uncertainty quantification process, as well as the forecast efficiency in kriging inferences [16]. Therefore, we argue, the covariance model may play an even more profound role in co-kriging [7, 17]. This argument applies when the covariance structure is inferred from data, as is typically the case.


因此,正如我所说,我的理解是 sckit-learn 并不能真正处理这种情况,尽管文档中没有提到或暗示类似的事情(可能很有趣在项目页面打开相关问题)。这似乎是 , too, as well as in this CrossValidated thread 关于 GPML (Matlab) 工具箱的结论。

话虽如此,除了恢复到简单地分别对每个输出建模的选择(不是无效的选择,只要你记住你可能会从你的 3- D 输出元素),至少有一个 Python 工具箱似乎能够对多输出 GP 进行建模,即 runlmc (paper, code, documentation).