如何从内核密度估计中获取内核（最好是sklearn.neighbors）？

Question

我目前正在为时间序列数据集进行季节性估计。

我得到的是数据集中可能出现的frequencies/periods个数据集。因此，这有点嘈杂（例如，有一些时期 [100, 98, 101, 102] 实际上应该是 "the same"）。

为了估计尖锐的周期，我尝试通过核密度估计（kde，sklearn.neighbors.KernelDensity）来估计峰值，如下所示：

import numpy as np
from sklearn.neighbors import KernelDensity
from scipy import signal
import matplotlib.pyplot as plt

X1 = np.random.randint(1, 4, 20)
X2 = np.random.randint(10, 13, 200)
X = np.concatenate((X1, X2), axis=0)
# the peaks schould be at 2 and 11!

bw = 1

kde = KernelDensity(kernel='gaussian', bandwidth=bw).fit(X.reshape(-1, 1))
estimator = np.linspace(0, 15, 100)
kde_est = np.exp(kde.score_samples(estimator.reshape(-1, 1)))

plt.plot(estimator, kde_est)

peaks_pos = signal.argrelextrema(kde_est, np.greater)[0]

print(estimator[peaks_pos])
# the peaks are at around 2 and 11!

另外，我想知道这个估计的内核是什么样子的。对于高斯情况，应该有一组 /mu 和 /sigma 应该可用于所有 [默认] 40 个内核。我可以访问这些信息吗？我在文档或 kde 属性的详细信息中找不到线索。但我很确定，这里应该有。

为了澄清，我为什么需要这个：

在下面的示例中，2 个峰靠得太近而无法找到，但我确信内核会出现。

X1 = np.random.randint(1, 4, 20)
X2 = np.random.randint(5, 8, 200)
X = np.concatenate((X1, X2), axis=0)
# the peaks schould be at 2 and 6!

bw = 1

kde = KernelDensity(kernel='gaussian', bandwidth=bw).fit(X.reshape(-1, 1))
estimator = np.linspace(0, 15, 100)
kde_est = np.exp(kde.score_samples(estimator.reshape(-1, 1)))

plt.plot(estimator, kde_est)

peaks_pos = signal.argrelextrema(kde_est, np.greater)[0]

print(estimator[peaks_pos])
# the peaks are at around 6 and sometimes 2!

Answer 1

我相信核密度估计中找不到您要找的东西。 KDE 中的所有内核都具有完全相同的形状（标准偏差）并以数据点为中心（因此均值由 X 中的值决定）。

你可以做些什么来防止正态分布与模糊峰的接近度是调整带宽（如果你的第二个样本，我通过使用 0.7 的带宽设法获得非常一致的 2 个峰。有代数方法可以执行此操作（请参阅：维基百科），或者您可以使用交叉验证为您的样本选择最佳带宽（请参阅：blog on the subject).

但是，如果您想将数据集拆分为由具有各种形状（权重、均值和协方差）的正态分布描述的不同组件，您可能需要使用高斯混合建模。你可以在下面找到一个例子。为了确定组件的最佳数量，有多种方法，例如轮廓标准或 akaike 信息标准（内置于 scikitlearn 中）。因为我们知道示例中有 2 个正态分布，所以我没有实施这样的标准，但您可以在 Internet 上轻松找到更多信息。

X1 = np.random.randint(1, 4, 20)
X2 = np.random.randint(5, 8, 200)
X = np.concatenate((X1, X2), axis=0)
# the peaks schould be at 2 and 6!

components = 2

gmm = GaussianMixture(n_components = components).fit(X.reshape(-1,1))

#you can now directly get the means from the gaussian mixture models components,
#skipping the score_samples and signal.argrelextrema steps.
print gmm.means_
#the means are around 2 and 6!


#your original method of getting the peaks:
estimator = np.linspace(0, 15, 100)
gmm_est = np.exp(gmm.score_samples(estimator.reshape(-1,1)))

plt.hist(X,normed=True)
plt.plot(estimator,gmm_est,linewidth=5,color='black',alpha=0.7)


peaks_pos = signal.argrelextrema(gmm_est, np.greater)[0]

print(estimator[peaks_pos])


#plotting the separate components:
for n,weight in enumerate(gmm.weights_):
    plt.plot(estimator,weight*stats.norm.pdf(estimator,gmm.means_[n][0],np.sqrt(gmm.covariances_[n][0][0])))
plt.show()

image of results

如何从内核密度估计中获取内核（最好是sklearn.neighbors）？

How to get Kernels from kernel density estimation (preferrably sklearn.neighbors)?

python

kernel-density

scikit-learn