无法在 Matplotlib 直方图上获取 y 轴以显示概率

Question

我的数据（pd 系列）看起来像（每日股票 returns，n = 555）：

S = perf_manual.returns
S = S[~((S-S.mean()).abs()>3*S.std())]

2014-03-31 20:00:00    0.000000
2014-04-01 20:00:00    0.000000
2014-04-03 20:00:00   -0.001950
2014-04-04 20:00:00   -0.000538
2014-04-07 20:00:00    0.000764
2014-04-08 20:00:00    0.000803
2014-04-09 20:00:00    0.001961
2014-04-10 20:00:00    0.040530
2014-04-11 20:00:00   -0.032319
2014-04-14 20:00:00   -0.008512
2014-04-15 20:00:00   -0.034109
...

我想从中生成一个概率分布图。使用：

print stats.normaltest(S)

n, bins, patches = plt.hist(S, 100, normed=1, facecolor='blue', alpha=0.75)
print np.sum(n * np.diff(bins))

(mu, sigma) = stats.norm.fit(S)
print mu, sigma
y = mlab.normpdf(bins, mu, sigma)
plt.grid(True)
l = plt.plot(bins, y, 'r', linewidth=2)

plt.xlim(-0.05,0.05)
plt.show()

我得到以下信息：

NormaltestResult(statistic=66.587382579416982, pvalue=3.473230376732532e-15)
1.0
0.000495624926242 0.0118790391467

我的印象是 y 轴是一个计数，但我想改用概率。我该怎么做？ 我已经尝试了很多 Whosebug 的答案，但无法弄清楚。

Answer 1

没有简单的方法（据我所知）可以使用 plt.hist 来做到这一点。但是您可以简单地使用 np.histogram 对数据进行分类，然后以任何您想要的方式对数据进行归一化。如果我理解正确的话，您希望数据显示在给定 bin 中找到一个点的概率，而不是概率分布。这意味着您必须缩放数据，使所有 bin 的总和为 1。这可以通过 bin_probability = n/float(n.sum()).

简单地完成

您将不再拥有正确归一化的概率分布函数 (pdf)，这意味着区间内的积分将不是概率！这就是为什么您必须重新调整 mlab.normpdf 以具有与直方图相同的范数的原因。所需的因子只是 bin 宽度，因为当您从正确标准化的 binned pdf 开始时，所有 bin 的总和乘以它们各自的宽度为 1。现在您只想让 bin 的总和等于 1。因此比例因子是bin 宽度。

因此，您最终得到的代码大致如下：

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab

# Produce test data
S = np.random.normal(0, 0.01, size=1000)

# Histogram:
# Bin it
n, bin_edges = np.histogram(S, 100)
# Normalize it, so that every bins value gives the probability of that bin
bin_probability = n/float(n.sum())
# Get the mid points of every bin
bin_middles = (bin_edges[1:]+bin_edges[:-1])/2.
# Compute the bin-width
bin_width = bin_edges[1]-bin_edges[0]
# Plot the histogram as a bar plot
plt.bar(bin_middles, bin_probability, width=bin_width)

# Fit to normal distribution
(mu, sigma) = stats.norm.fit(S)
# The pdf should not normed anymore but scaled the same way as the data
y = mlab.normpdf(bin_middles, mu, sigma)*bin_width
l = plt.plot(bin_middles, y, 'r', linewidth=2)

plt.grid(True)
plt.xlim(-0.05,0.05)
plt.show()

生成的图片将是：

Answer 2

jotasi 的答案当然有效，但我想通过直接调用 hist.

添加一个非常简单的技巧来实现此目的

诀窍是使用 weights 参数。默认情况下，您传递的每个数据点的权重都是 1。每个 bin 的高度就是落入该 bin 的数据点的权重之和。相反，如果我们有 n 个点，我们可以简单地使每个点的权重为 1 / n。那么，落入某个桶中的点的权重之和也就是给定点在那个桶中的概率。

对于您的情况，只需将绘图线更改为：

n, bins, patches = plt.hist(S, weights=np.ones_like(S) / len(S),
                            facecolor='blue', alpha=0.75)

无法在 Matplotlib 直方图上获取 y 轴以显示概率

Can't get y-axis on Matplotlib histogram to display probabilities

python

matplotlib

histogram

probability-density