如何从经验 cdf 计算和绘制 pdf？

Question

我有两个 numpy 数组，一个是 x 值数组，另一个是 y 值数组，它们一起给了我经验 cdf。例如：

plt.plot(xvalues, yvalues)
plt.show()

我假设需要以某种方式平滑数据以提供平滑的 pdf。

我想绘制pdf。我该怎么做？

原始数据位于：http://dpaste.com/1HVK5DR .

Answer 1

您需要导数才能将 CDF 转换为 PDF

PDF(x) = d CDF(x)/ dx

对于 NumPy，您可以使用 gradient

pdf = np.gradient(yvalues, xvalues)
plt.plot(xvalues, pdf)
plt.show()

或手动微分

pdf = np.diff(yvalues)/np.diff(xvalues)
l = np.asarray(xvalues[:-1])
r = np.asarray(xvalues[1:])
plt.plot((l+r)/2.0, pdf) # points in the middle of interval
plt.show()

两者都产生类似的东西，更新的图片不知何故搞砸了

Answer 2

有两个主要问题：您的数据似乎很嘈杂，而且分布不均：低端的点采样非常密集，而高端的点采样非常稀疏。这可能会导致数值问题。

所以首先我建议使用线性插值对数据进行重采样以获得等间隔的样本：（请注意，附加到彼此的所有片段构成 one python 文件。）

import matplotlib.pyplot as plt
import numpy as np
from data import xvalues, yvalues #load data from file


print("#datapoints: {}".format(len(xvalues)))
#don't use every point if your computer is not very fast
xv = np.array(xvalues)[::5]
yv = np.array(yvalues)[::5]

#interpolate to have evenly space data
xi = np.linspace(xv.min(), xv.max(), 400)
yi = np.interp(xi, xv, yv)

然后，为了平滑数据，我建议执行 RBF 回归（=使用 "RBF Network"）。这个想法是拟合形式的曲线

 c(t) = sum a(i) * phi(t - x(i))    #(not part of the program)

其中 phi 是一些径向基函数。（理论上我们可以使用 any 函数。）为了获得非常平滑的结果，我选择了一个非常平滑的函数，即高斯函数：phi(x) = exp( - x^2/sigma^2) 其中 sigma 是待定。 x(i) 只是我们可以定义的一些节点。如果我们有一个平滑的函数，我们只需要几个节点。节点的数量也决定了需要完成多少计算。 a(i) 是我们可以优化以获得最佳拟合的系数。在这种情况下，我只使用最小二乘法。

注意，如果我们可以写成上面形式的函数，计算导数就很容易了，就是

 c(t) = sum a(i) * phi'(t - x(i))

其中 phi' 是 phi 的导数。 #（不是程序的一部分）

关于sigma：选择它作为我们选择的节点之间的步长的倍数通常是个好主意。我们选择 sigma 越大，结果函数越平滑。

#set up rbf network
rbf_nodes = xv[::50][None, :]#use a subset of the x-values as rbf nodes
print("#rbfs: {}".format(rbf_nodes.shape[1]))
#estimate width of kernels:
sigma = 20  #greater = smoother, this is the primary parameter to play with
sigma *= np.max(np.abs(rbf_nodes[0,1:]-rbf_nodes[0,:-1]))


# kernel & derivative
rbf = lambda r:1/(1+(r/sigma)**2)
Drbf = lambda r: -2*r*sigma**2/(sigma**2 + r**2)**2

#compute coefficients of rbf network
r = np.abs(xi[:, None]-rbf_nodes)
A = rbf(r)
coeffs = np.linalg.lstsq(A, yi, rcond=None)[0]
print(coeffs)

#evaluate rbf network
N=1000
xe = np.linspace(xi.min(), xi.max(), N)
Ae = rbf(xe[:, None] - rbf_nodes)
ye = Ae @ coeffs


#evaluate derivative
N=1000
xd = np.linspace(xi.min(), xi.max(), N)
Bd = Drbf(xe[:, None] - rbf_nodes)
yd = Bd @ coeffs


fig,ax = plt.subplots()
ax2 = ax.twinx()

ax.plot(xv, yv, '-')
ax.plot(xi, yi, '-')
ax.plot(xe, ye, ':')

ax2.plot(xd, yd, '-')

fig.savefig('graph.png')
print('done')

如何从经验 cdf 计算和绘制 pdf？

How to compute and plot the pdf from the empirical cdf?

python

statistics

numpy

probability

matplotlib