中心极限定理：样本均值不服从正态分布

Question

问题

晚上好。

我正在学习中心极限定理。作为练习，我运行模拟试图找到公平骰子的平均值（我知道，一个玩具问题）。

我取了 4000 个样本，在每个样本中我掷了 50 次骰子（代码截图在底部）。对于这 4000 个样本中的每一个，我都计算了平均值。然后，我使用 matplotlib.

在直方图中绘制了这 4000 个样本均值（bin 大小为 0.03）

结果如下：

问题

考虑到 CLT 的条件（样本大小 >= 30），为什么样本均值不是正态分布的？

具体来说，为什么直方图看起来像两个正态分布相互叠加？更有趣的是，为什么“外部”分布看起来“离散”，空白空间以规则间隔出现？

结果似乎以系统的方式出现偏差。

非常感谢所有帮助。我很迷茫。

补充代码

我用来生成 4000 个样本的代码。

"""
Take multiple samples of dice rolls. For
each sample, compute the sample mean.

With the sample means, plot a histogram.
By the Central Limit Theorem, the sample
means should be normally distributed.

"""

sample_means = []

num_samples = 4000

for i in range(num_samples):
    # Large enough for CLT to hold
    num_rolls = 50
    
    sample = []
    for j in range(num_rolls):
        observation = random.randint(1, 6)
        sample.append(observation)
    
    sample_mean = sum(sample) / len(sample)
    sample_means.append(sample_mean)

Answer 1

当num_rolls等于50时，每个可能的均值将是分母为50的分数。所以，实际上，您看到的是离散分布。

要创建离散分布的直方图，bin 边界最好放置在值之间。使用 0.03 的步长，一些 bin 边界将与值重合，将两倍的值放入与其相邻的 bin 中。此外，由于微妙的浮点舍入问题，当值和边界重合时，结果可能变得不可预测。

下面是一些代码来说明发生了什么：

from matplotlib import pyplot as plt
import numpy as np
import random

sample_means = []
num_samples = 4000

for i in range(num_samples):
    num_rolls = 50
    sample = []
    for j in range(num_rolls):
        observation = random.randint(1, 6)
        sample.append(observation)

    sample_mean = sum(sample) / len(sample)
    sample_means.append(sample_mean)

fig, axs = plt.subplots(2, 2, figsize=(14, 8))

random_y = np.random.rand(len(sample_means))
for (ax0, ax1), step in zip(axs, [0.03, 0.02]):
    bins = np.arange(3.01, 4, step)
    ax0.hist(sample_means, bins=bins)
    ax0.set_title(f'step={step}')
    ax0.vlines(bins, 0, ax0.get_ylim()[1], ls=':', color='r')  # show the bin boundaries in red
    ax1.scatter(sample_means, random_y, s=1)  # show the sample means with a random y
    ax1.vlines(bins, 0, 1, ls=':', color='r')  # show the bin boundaries in red
    ax1.set_xticks(np.arange(3, 4, 0.02))
    ax1.set_xlim(3.0, 3.3)  # zoom in to region to better see the ins
    ax1.set_title('bin boundaries between values' if step == 0.02 else 'chaotic bin boundaries')
plt.show()

PS：请注意，如果不使用 Python 列表，代码会运行快得多，代码将完全与 numpy 一起工作。

中心极限定理：样本均值不服从正态分布

Central Limit Theorem: Sample means do not follow a normal distribution

simulation

statistics

matplotlib

jupyter-notebook

问题

问题

补充代码