Scikit 在使用 fit() 函数时学习 GaussianProcessClassifier 内存错误

Question

我有 X_train 和 y_train 作为大小分别为 (32561, 108) 和 (32561,) 的 2 numpy.ndarrays。

每次调用适合我的 GaussianProcessClassifier 时，我都会收到内存错误。

>>> import pandas as pd
>>> import numpy as np
>>> from sklearn.gaussian_process import GaussianProcessClassifier
>>> from sklearn.gaussian_process.kernels import RBF
>>> X_train.shape
(32561, 108)
>>> y_train.shape
(32561,)
 >>> gp_opt = GaussianProcessClassifier(kernel=1.0 * RBF(length_scale=1.0))
>>> gp_opt.fit(X_train,y_train)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 613, in fit
    self.base_estimator_.fit(X, y)
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 209, in fit
    self.kernel_.bounds)]
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 427, in _constrained_optimization
    fmin_l_bfgs_b(obj_func, initial_theta, bounds=bounds)
  File "/home/retsim/anaconda2/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 199, in fmin_l_bfgs_b
    **opts)
  File "/home/retsim/anaconda2/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 335, in _minimize_lbfgsb
    f, g = func_and_grad(x)
  File "/home/retsim/anaconda2/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 285, in func_and_grad
    f = fun(x, *args)
  File "/home/retsim/anaconda2/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 292, in function_wrapper
    return function(*(wrapper_args + args))
  File "/home/retsim/anaconda2/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 63, in __call__
    fg = self.fun(x, *args)
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 201, in obj_func
    theta, eval_gradient=True)
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 338, in log_marginal_likelihood
    K, K_gradient = kernel(self.X_train_, eval_gradient=True)
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/kernels.py", line 753, in __call__
    K1, K1_gradient = self.k1(X, Y, eval_gradient=True)
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/kernels.py", line 1002, in __call__
    K = self.constant_value * np.ones((X.shape[0], Y.shape[0]))
  File "/home/retsim/.local/lib/python2.7/site-packages/numpy/core/numeric.py", line 188, in ones
    a = empty(shape, dtype, order)
MemoryError
>>>

为什么会出现此错误，我该如何解决？

Answer 1

在第 400 of gpc.py 行，您正在使用的分类器的实现，创建了一个大小为 (N, N) 的矩阵，其中 N 是观测值的数量。所以代码试图创建一个形状为 (32561, 32561) 的矩阵。这显然会导致一些问题，因为该矩阵有超过十亿个元素。

至于为什么这样做，我真的不知道 scikit-learn 的实现，但一般来说，高斯过程需要估计整个输入的协方差矩阵 space，这就是为什么如果您有 high-dimensional 数据，它们就不是那么好。（文档说 "high-dimensional" 大于几十。）

我对如何修复它的唯一建议是分批处理。 Scikit-learn 可能有一些实用程序可以为您生成批次，或者您可以手动进行。

Answer 2

根据 Scikit-Learn documentation，估计器 GaussianProcessClassifier（以及 GaussianProcessRegressor）有一个参数copy_X_train 默认设置为 True：

class sklearn.gaussian_process.GaussianProcessClassifier(kernel=None, optimizer=’fmin_l_bfgs_b’, n_restarts_optimizer=0, max_iter_predict=100, warm_start=False, copy_X_train=True, random_state=None, multi_class=’one_vs_rest’, n_jobs=1)

参数 copy_X_train 的说明指出：

If True, a persistent copy of the training data is stored in the object. Otherwise, just a reference to the training data is stored, which might cause predictions to change if the data is modified externally.

我曾尝试在具有 32 GB RAM 的 PC 上使用 OP 提到的类似大小的训练数据集（观察和特征）来拟合估算器。 copy_X_train 设置为 True，'a persistent copy of the training data' 可能会吃掉我的RAM 导致 MemoryError。将此参数设置为 False 可解决此问题。

Scikit-Learn 的描述指出，基于此设置 'just a reference to the training data is stored, which might cause predictions to change if the data is modified externally'。我对这句话的解读是：

Instead of storing the whole training dataset (in the form of a matrix of size nxn based on n observations) in the fitted estimator, only a reference to this dataset is stored - hence avoiding the high RAM usage. As long as the dataset stays intact externally (i.e not within the fitted estimator), it can be reliably fetched when a prediction has to be made. Modification of the dataset affects the predictions.

可能会有更好的解释和理论解释。

Scikit 在使用 fit() 函数时学习 GaussianProcessClassifier 内存错误

Scikit learn GaussianProcessClassifier memory error when using fit() function

python

classification

pandas

scikit-learn

sklearn-pandas