如何计算已装箱数据的峰度？

Question

有谁知道如何使用 Python 单独从分箱数据计算分布的峰态？

我有分布的直方图，但没有原始数据。有两列；一个带有箱号，一个带有计数号。我需要计算分布的峰度。

如果我有原始数据，我可以使用 scipy 函数来计算峰度。我在本文档中看不到任何使用合并数据进行计算的内容。 https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kurtosis.html

带 scipy 的分箱统计选项允许您计算分箱内的峰态，但仅使用原始数据且仅在分箱内。 https://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.stats.binned_statistic.html

编辑：示例数据。我可以尝试从中重新采样以创建我自己的虚拟原始数据，但我每天有大约 140k 到运行，并且希望有一些内置的东西。

Index,Bin,Count
 0, 730, 30
 1, 735, 45
 2, 740, 41
 3, 745, 62
 4, 750, 80
 5, 755, 96
 6, 760, 94
 7, 765, 90
 8, 770, 103
 9, 775, 96
10, 780, 95
11, 785, 109
12, 790, 102
13, 795, 99
14, 800, 93
15, 805, 101
16, 810, 109
17, 815, 98
18, 820, 89
19, 825, 62
20, 830, 71
21, 835, 69
22, 840, 58
23, 845, 50
24, 850, 42

Answer 1

您可以直接计算统计数据。如果 x 是您的 bin 编号，而 y 是每个 bin 的计数，则 f(x) 的预期值等于 np.sum(y*f(x))/np.sum(y)。我们可以使用它来将峰度的公式转换为以下代码：

total = np.sum(y)
mean = np.sum(y * x) / total
variance = np.sum(y * (x - mean)**2) / total
kurtosis = np.sum(y * (x - mean)**4) / (variance**2 * total)

请注意峰度和超峰度不是一回事。

如何计算已装箱数据的峰度？

How can I calculate the kurtosis of already binned data?

python

histogram

scipy

pandas

kurtosis