sklearn：KDE 不适用于小值

Question

我正在努力为小输入范围实现 KDE 的 scikit-learn 实现。以下代码有效。将除数变量增加到 100 和 KDE 挣扎：

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
from sklearn.neighbors import KernelDensity

# make data:
np.random.seed(0)
divisor = 1
gaussian1 = (3 * np.random.randn(1700))/divisor
gaussian2 = (9 + 1.5 * np.random.randn(300)) / divisor
gaussian_mixture = np.hstack([gaussian1, gaussian2])

# illustrate proper KDE with seaborn:
sns.distplot(gaussian_mixture);

# now implement in sklearn:

x_grid = np.linspace(min(gaussian1), max(gaussian2), 200)

kde_skl = KernelDensity(bandwidth=0.5)
kde_skl.fit(gaussian_mixture[:, np.newaxis])
# score_samples() returns the log-likelihood of the samples
log_pdf = kde_skl.score_samples(x_grid[:, np.newaxis])
pdf = np.exp(log_pdf)

fig, ax = plt.subplots(1, 1, sharey=True, figsize=(7, 4))
ax.plot(x_grid, pdf, linewidth=3, alpha=0.5)

工作正常。但是，将 'divisor' 变量更改为 100 并且 scipy 和 seaborn 可以处理较小的数据值。 Sklearn 的 KDE 不能用我的实现：

我做错了什么，我该如何纠正？我需要 KDE 的 sklearns 实现，所以不能使用其他算法。

Answer 1

核密度估计被称为非参数方法，但实际上它有一个参数叫做带宽。

KDE的每个应用程序都需要这个参数集！

当您执行 seaborn-plot 时：

sns.distplot(gaussian_mixture);

您没有提供任何带宽，并且 seaborn 使用默认启发式算法（scott 或 silverman）。这些是使用数据以依赖的方式选择一些带宽。

你的 sklearn-code 看起来像：

kde_skl = KernelDensity(bandwidth=0.5)

有 fixed/constant 带宽！ 这可能会给您带来麻烦，也可能是这里的原因。但这至少值得一看。一般来说，人们会将 sklearn 的 KDE 与 GridSearchCV 结合起来作为交叉验证工具，以获得 select 良好的带宽。在许多情况下，这比上面的启发式方法更慢，但更好。

很遗憾，您没有解释为什么要使用 sklearn 的 KDE。我对 3 位热门候选人的个人评分是 statsmodels > sklearn > scipy。

sklearn：KDE 不适用于小值

sklearn: KDE not working for small values

python

gaussian

kernel-density

scikit-learn