绘制数据点在分布中的位置

Question

假设我有一个大数据集，我可以在某种分析中对其进行操作。这可以查看概率分布中的值。

现在我有了这个大数据集，然后我想将已知的实际数据与它进行比较。主要是，我的数据集中有多少值与已知数据具有相同的值或属性。例如：

这是一个累积分布。连续线来自模拟生成的数据，强度下降只是预测的百分比。然后，星星是根据生成的数据绘制的观测（已知）数据。

我做的另一个例子是如何在视觉上将点投影到直方图上：

我很难标记已知数据点在生成的数据集中的位置并将其与生成的数据的分布一起累积绘制。

如果我要尝试检索落在生成数据附近的点数，我会这样开始（这是不对的）：

def SameValue(SimData, DefData, uncert):
     numb = [(DefData-uncert) < i < (DefData+uncert) for i in SimData]
     return sum(numb)

但我无法解释落在值范围内的点，然后将其全部设置到我可以绘制的位置。关于如何收集这些数据并将其投影到累积分布的任何想法？

Answer 1

这个问题很混乱，有很多不相关的信息，但在关键点上却很模糊。我会尽力解释它。

我认为你所追求的是：给定一个来自未知分布的有限样本，在固定值下获得新样本的概率是多少？

我不确定是否有一个通用的答案，但无论如何这都是一个要问统计或数学人员的问题。我的猜测是您需要对分布本身做出一些假设。

然而，对于实际情况，找出新值位于采样分布的哪个 bin 中可能就足够了。

所以假设我们有一个分布 x，我们将其分成 bins。我们可以使用 numpy.histogram 计算直方图 h。在每个 bin 中找到值的概率由 h/h.sum().
给出有一个值 v=0.77，我们想知道它根据分布的概率，我们可以通过在该值所在的 bin 数组中查找索引 ind 来找出它所属的 bin需要插入数组才能保持排序。这可以使用 numpy.searchsorted.

来完成

import numpy as np; np.random.seed(0)

x = np.random.rayleigh(size=1000)
bins = np.linspace(0,4,41)
h, bins_ = np.histogram(x, bins=bins)
prob = h/float(h.sum())

ind = np.searchsorted(bins, 0.77, side="right")
print prob[ind] # which prints 0.058

所以在 0.77 左右的 bin 中采样值的概率是 5.8%。

另一种选择是在 bin 中心之间插入直方图，以找到概率。

在下面的代码中，我们绘制了一个类似于问题图片中的分布，并使用了两种方法，第一种用于频率直方图，第二种用于累积分布。

import numpy as np; np.random.seed(0)
import matplotlib.pyplot as plt

x = np.random.rayleigh(size=1000)
y = np.random.normal(size=1000)
bins = np.linspace(0,4,41)
h, bins_ = np.histogram(x, bins=bins)
hcum = np.cumsum(h)/float(np.cumsum(h).max())

points = [[.77,-.55],[1.13,1.08],[2.15,-.3]]
markers = [ur'$\u2660$',ur'$\u2665$',ur'$\u263B$']
colors = ["k", "crimson" , "gold"]
labels = list("ABC")

kws = dict(height_ratios=[1,1,2], hspace=0.0)
fig, (axh, axc, ax) = plt.subplots(nrows=3, figsize=(6,6), gridspec_kw=kws, sharex=True)

cbins = np.zeros(len(bins)+1)
cbins[1:-1] = bins[1:]-np.diff(bins[:2])[0]/2.
cbins[-1] = bins[-1]
hcumc = np.linspace(0,1, len(cbins))
hcumc[1:-1] = hcum
axc.plot(cbins, hcumc, marker=".", markersize="2", mfc="k", mec="k" )
axh.bar(bins[:-1], h, width=np.diff(bins[:2])[0], alpha=0.7, ec="C0", align="edge")
ax.scatter(x,y, s=10, alpha=0.7)

for p, m, l, c in zip(points, markers, labels, colors):
    kw = dict(ls="", marker=m, color=c, label=l, markeredgewidth=0, ms=10)
    # plot points in scatter distribution
    ax.plot(p[0],p[1], **kw)
    #plot points in bar histogram, find bin in which to plot point
    # shift by half the bin width to plot it in the middle of bar
    pix = np.searchsorted(bins, p[0], side="right")
    axh.plot(bins[pix-1]+np.diff(bins[:2])[0]/2., h[pix-1]/2., **kw)
    # plot in cumulative histogram, interpolate, such that point is on curve.
    yi = np.interp(p[0], cbins, hcumc)
    axc.plot(p[0],yi, **kw)
ax.legend()
plt.tight_layout()  
plt.show()

绘制数据点在分布中的位置

Plotting data points on where they fall in a distribution

python

arrays

numpy

distribution

matplotlib