Scipy 高斯 KDE：矩阵不是正定的

Question

我正在尝试估计数据集在某些点的密度，使用 scipy。

from scipy.stats import gaussian_kde
import numpy as np

我有一个 3D 点的数据集 A（这只是一个最小的例子。我的实际数据有更多的维度和更多的样本）

A = np.array([[0.078377  , 0.76737392, 0.45038174],
       [0.65990129, 0.13154658, 0.30770917],
       [0.46068406, 0.22751313, 0.28122463]])

以及我想估计密度的点

B = np.array([[0.40209377, 0.21063273, 0.75885516],
       [0.91709997, 0.79303252, 0.65156937]])

但是我好像无法使用gaussian_kde功能，因为

result = gaussian_kde(A.T)(B.T)

returns

LinAlgError: Matrix is not positive definite

如何修复此错误？我如何获得样品的密度？

Answer 1

TL;博士：

您的数据中具有高度相关的特征，这会导致数值错误。有几种可能的方法来解决这个问题，每种方法都有利有弊。下面提议用 class 替代 gaussian_kde。

诊断

您的 dataset（即您在创建 gaussian_kde 对象时输入的矩阵，在使用它时 而不是 ）可能包含高度依赖的特征。这一事实（可能与 float32 等低数值分辨率和“太多”数据点相结合）导致协方差矩阵具有接近零的特征值，由于数值误差可能会低于零。这反过来又会破坏在数据集协方差矩阵上使用 Cholesky 分解的代码（具体细节见解释）。

假设您的数据集具有 (dims, N) 的形状，您可以通过 np.linalg.eigh(np.cov(dataset))[0] <= 0 测试这是否是您的问题。如果有任何产出True，让我第一个欢迎你加入俱乐部。

治疗

想法是让所有特征值都大于零。

将数值分辨率提高到实用的最高浮点数可能是一个简单的解决方法，值得一试，但可能还不够。
鉴于这是由相关特征引起的事实，删除数据点并没有多大帮助先验 .有一个渺茫的希望，即要粉碎的数字越少，传播的错误就越少，并使特征值保持在零以上。它很容易实现，但它会丢弃数据点。
一个更复杂的修复是识别高度相关的特征并合并它们或忽略“多余”的特征。这很棘手，尤其是当维度之间的相关性因实例而异时。
可能最干净的方法是保持数据不变，并将有问题的特征值提升为正值。这通常通过两种方式完成：
SVD 直接解决了问题的核心：分解协方差矩阵并将负特征值替换为小的正特征值 epsilon。这将使您的矩阵回到 PD 域，引入最小的错误。
如果 SVD 的计算成本太高，另一种数值破解方法是将 epsilon * np.eye(D) 添加到协方差矩阵。这具有将 epsilon 添加到每个特征值的效果。同样，如果 epsilon 足够小，则不会引入那么大的错误。

如果你选择最后一种方法，你需要告诉 gaussian_kde 修改它的协方差矩阵。这是我发现的一种相对干净的方法：只需将此 class 添加到您的代码库并将 gaussian_kde 替换为 GaussianKde （在我这边测试过，似乎工作正常）。

    class GaussianKde(gaussian_kde):
        """
        Drop-in replacement for gaussian_kde that adds the class attribute EPSILON
        to the covmat eigenvalues, to prevent exceptions due to numerical error.
        """
    
        EPSILON = 1e-10  # adjust this at will
    
        def _compute_covariance(self):
            """Computes the covariance matrix for each Gaussian kernel using
            covariance_factor().
            """
            self.factor = self.covariance_factor()
            # Cache covariance and inverse covariance of the data
            if not hasattr(self, '_data_inv_cov'):
                self._data_covariance = np.atleast_2d(np.cov(self.dataset, rowvar=1,
                                                             bias=False,
                                                             aweights=self.weights))
                # we're going the easy way here
                self._data_covariance += self.EPSILON * np.eye(
                    len(self._data_covariance))
                self._data_inv_cov = np.linalg.inv(self._data_covariance)
    
            self.covariance = self._data_covariance * self.factor**2
            self.inv_cov = self._data_inv_cov / self.factor**2
            L = np.linalg.cholesky(self.covariance * 2 * np.pi)
            self._norm_factor = 2*np.log(np.diag(L)).sum()  # needed for scipy 1.5.2
            self.log_det = 2*np.log(np.diag(L)).sum()  # changed var name on 1.6.2

说明

如果您的错误与此类似，但又不尽然，或者有人感到好奇，这是我遵循的过程，希望对您有所帮助。

异常堆栈指定错误是在 Cholesky 分解期间触发的。就我而言，这是 _compute_covariance 方法中的 this line。
确实，在异常发生后，通过 np.eigh 检查 covariance 和 inv_cov 属性的特征值显示 covariance 有一个接近-为零的负特征值，因此它的逆有一个巨大的。由于 Cholesky 需要 PD 矩阵，因此这引发了错误。
在这一点上，我们可以严重怀疑微小的负特征值是数值错误，因为协方差矩阵是 PSD。事实上，当协方差矩阵最初是根据传递给构造函数的数据集计算时，错误源就来了，here。在我的例子中，协方差矩阵产生了至少 2 个几乎相同的列，这导致了由于数值误差导致的剩余负特征值。
你的数据集什么时候会产生准奇异协方差矩阵？这个问题在另一个 SE post. The bottom line is: If some variable is an exact linear combination of the other variables, with constant term allowed, the correlation and covariance matrices of the variables will be singular. The dependency observed in such matrix between its columns is actually that same dependency as the dependency between the variables in the data observed after the variables have been centered (their means brought to 0) or standardized (if we mean correlation rather than covariance matrix) (Kudos and +1 to ttnphns 中得到了完美的解决，这是一项了不起的工作。

Scipy 高斯 KDE：矩阵不是正定的

Scipy Gaussian KDE : Matrix is not positive definite

python

scipy

kernel-density

TL;博士：

诊断

治疗

说明