bootstrap 估计方差为何减少的数学解释

Question

我正在尝试 grok bootstrapping 和 bagging（bootstrap 聚合），所以我'一直在尝试进行一些实验。我加载了 sample dataset from Kaggle 并尝试使用 bootstrapping 方法：

X = pd.read_csv("dataset.csv")
true_median = np.median(X["Impressions"])
B = 500
    errors = []
    variances = []
    for b in range(1, B):
        sample_medians = [np.median(X.sample(len(X), replace=True)["Impressions"]) for i in range(b)]
        error = np.mean(sample_medians) - true_median
        variances.append(np.std(sample_medians) ** 2)
        errors.append(error)

然后我想象了 errors 和 variances:

fig, ax1 = plt.subplots()

color = 'tab:red'
ax1.set_xlabel('Number of Bootstrap Samples (B)')
ax1.set_ylabel('Bootstrap Estimate Error', color=color)
ax1.plot(errors, color=color, alpha=0.7)
ax1.tick_params(axis='y', labelcolor=color)

ax2 = ax1.twinx()

color = 'tab:blue'
ax2.set_ylabel('Bootstrap Estimate Variance', color=color)
ax2.plot(variances, color=color, alpha=0.7)
ax2.tick_params(axis='y', labelcolor=color)

fig.tight_layout()
plt.title("Relationship Between Bootstrap Error, Variance \nand Number of Bootstrap Iterations")
plt.show()

这是绘图的输出：

您可以看到误差和方差都随着 B 的增加而减小。我正在尝试找到某种数学理由 - 有没有一种方法可以推导或证明 为什么当 B 增加时 bootstrap 估计的方差会减少？

Answer 1

我想你看到的是 Central-Limit 定理在起作用。当循环开始时，替换后总体中的样本数量很少，中位数的平均值（您称之为误差）不能代表达到真实的总体中位数。当您生成更多样本时，中位数的均值逐渐收敛到真实中位数。由于向真实均值收敛，该分布的样本距离不足以产生较大的方差，它也达到了收敛。

这说明了吗？如果没有，请详细说明您在绘制它们时期望看到的内容，我们可以讨论如何实现。

bootstrap 估计方差为何减少的数学解释

A mathematical explanation for why variance of bootstrap estimates decreases

python

numpy

machine-learning

sampling

data-science