使用scipy高斯核密度估计计算CDF逆

Question

scipy.stats中的gaussian_kde函数有一个函数evaluate可以returns一个输入点的PDF值。我正在尝试使用 gaussian_kde 来估计逆 CDF。动机是生成一些输入数据的 Monte Carlo 实现，这些数据的统计分布是使用 KDE 进行数值估计的。是否有绑定到 gaussian_kde 的方法可以达到此目的？

下面的示例显示了这对于高斯分布的情况应该如何工作。首先，我将展示如何进行 PDF 计算以设置我想要实现的特定 API：

import numpy as np 
from scipy.stats import norm, gaussian_kde

npts_kde = int(5e3)
n = np.random.normal(loc=0, scale=1, size=npts_kde)
kde = gaussian_kde(n)

npts_sample = int(1e3)
x = np.linspace(-3, 3, npts_sample)
kde_pdf = kde.evaluate(x)
norm_pdf = norm.pdf(x)

是否有类似的简单方法来计算逆 CDF？ norm 函数有一个非常方便的 isf 函数可以做到这一点：

cdf_value = np.sort(np.random.rand(npts_sample))
cdf_inv = norm.isf(1 - cdf_value)

kde_gaussian 有这样的功能吗？还是直接从已经实现的方法构造这样一个函数？

Answer 1

矢量形式的方法integrate_box_1d can be used to compute the CDF, but it is not vectorized; you'll need to loop over points. If memory is not an issue, rewriting its source code (which is essentially just a call to special.ndtr) 可能会加快处理速度。

from scipy.special import ndtr
stdev = np.sqrt(kde.covariance)[0, 0]
pde_cdf = ndtr(np.subtract.outer(x, n)).mean(axis=1)
plot(x, pde_cdf)

反函数的绘图将是 plot(pde_cdf, x)。如果目标是计算特定点的反函数，请考虑使用，对 CDF 的计算值进行插值。

Answer 2

您可以使用一些 python 技巧来快速 memory-effective 估计 CDF（基于 this answer）：

    from scipy.special import ndtr
    cdf = tuple(ndtr(np.ravel(item - kde.dataset) / kde.factor).mean()
                for item in x)

它的工作速度与一样快，但具有线性 (len(kde.dataset)) space 复杂度而不是二次（实际上，len(kde.dataset) * len(x)）复杂度。

接下来您要做的就是使用逆近似，例如，从statsmodels。

Answer 3

这个问题已经在其他答案中得到了回答，但我花了一段时间才想清楚所有的事情。这是最终解决方案的完整示例：

import numpy as np 
from scipy import interpolate
from scipy.special import ndtr
import matplotlib.pyplot as plt
from scipy.stats import norm, gaussian_kde

# create kde
npts_kde = int(5e3)
n = np.random.normal(loc=0, scale=1, size=npts_kde)
kde = gaussian_kde(n)

# grid for plotting
npts_sample = int(1e3)
x = np.linspace(-3, 3, npts_sample)

# evaluate pdfs
kde_pdf = kde.evaluate(x)
norm_pdf = norm.pdf(x)

# cdf and inv cdf are available directly from scipy
norm_cdf = norm.cdf(x)
norm_inv = norm.ppf(x)

# estimate cdf
cdf = tuple(ndtr(np.ravel(item - kde.dataset) / kde.factor).mean()
            for item in x)

# estimate inv cdf
inversefunction = interpolate.interp1d(cdf, x, kind='cubic', bounds_error=False)

fig, ax = plt.subplots(1, 3, figsize=(6, 3))
ax[0].plot(x, norm_pdf, c='k')
ax[0].plot(x, kde_pdf, c='r', ls='--')
ax[0].set_title('PDF')
ax[1].plot(x, norm_cdf, c='k')
ax[1].plot(x, cdf, c='r', ls='--')
ax[1].set_title('CDF')
ax[2].plot(x, norm_inv, c='k')
ax[2].plot(x, inversefunction(x), c='r', ls='--')
ax[2].set_title("Inverse CDF")

使用scipy高斯核密度估计计算CDF逆

Using scipy gaussian kernel density estimation to calculate CDF inverse

python

numpy

scientific-computing

scipy