如何防止大数据的 MLE 方法溢出

How to prevent overflow in MLE method for large data

我正在尝试使用 scipy 进行手动 MLE 估计。我的数据集不是那么大,所以令我惊讶的是我的值变得非常快非常快并且 scipy.optimize.minimize 似乎非常快地进入我的密度的 NaN。我尝试使用对数之和而不是密度的乘积,但这并没有使事情变得更好。

这是我的数据:

[
        1.000, 1.000, 1.000, 1.004, 1.005, 1.008, 1.014, 1.015, 1.023, 1.035, 1.038,
        1.046, 1.048, 1.050, 1.050, 1.052, 1.052, 1.057, 1.063, 1.070, 1.070, 1.076,
        1.087, 1.090, 1.091, 1.096, 1.101, 1.102, 1.113, 1.114, 1.120, 1.130, 1.131,
        1.150, 1.152, 1.154, 1.155, 1.162, 1.170, 1.177, 1.189, 1.191, 1.193, 1.200,
        1.200, 1.200, 1.200, 1.205, 1.210, 1.218, 1.238, 1.238, 1.241, 1.250, 1.250,
        1.256, 1.257, 1.272, 1.278, 1.289, 1.299, 1.300, 1.316, 1.331, 1.349, 1.374,
        1.378, 1.382, 1.396, 1.426, 1.429, 1.439, 1.443, 1.446, 1.473, 1.475, 1.478,
        1.499, 1.506, 1.559, 1.568, 1.594, 1.609, 1.626, 1.649, 1.650, 1.669, 1.675,
        1.687, 1.715, 1.720, 1.735, 1.750, 1.755, 1.787, 1.797, 1.805, 1.898, 1.908,
        1.940, 1.989, 2.010, 2.012, 2.024, 2.047, 2.081, 2.085, 2.097, 2.136, 2.178,
        2.181, 2.193, 2.200, 2.220, 2.301, 2.354, 2.359, 2.382, 2.409, 2.418, 2.430,
        2.477, 2.500, 2.534, 2.572, 2.588, 2.591, 2.599, 2.660, 2.700, 2.700, 2.744,
        2.845, 2.911, 2.952, 3.006, 3.021, 3.048, 3.059, 3.092, 3.152, 3.276, 3.289,
        3.440, 3.447, 3.498, 3.705, 3.870, 3.896, 3.969, 4.000, 4.009, 4.196, 4.202,
        4.311, 4.467, 4.490, 4.601, 4.697, 5.100, 5.120, 5.136, 5.141, 5.165, 5.260,
        5.329, 5.778, 5.794, 6.285, 6.460, 6.917, 7.295, 7.701, 8.032, 8.142, 8.864,
        9.263, 9.359, 10.801, 11.037, 11.504, 11.933, 11.998, 12.000, 14.153, 15.000,
        15.398, 19.793, 23.150, 27.769, 28.288, 34.325, 42.691, 62.037, 77.839
]

如何在没有 运行 过低问题的情况下执行 MLE 算法?

如果您想知道,我正在尝试拟合函数 f(x) = alpha * 500000^(alpha) / x^(alpha+1)。所以我有的是

import numpy as np
from scipy.optimize import minimize

# data = the given dataset
log_pareto_pdf = lambda alpha, x: np.log(alpha * (5e5 ** alpha) / (x ** (alpha + 1)))
ret = minimize(lambda alpha: -np.sum([log_pareto_pdf(alpha, x) for x in data]), x0=np.array([1]))

你需要避免巨大的求幂。一种方法是实际简化您的函数:

log_pareto_pdf = lambda alpha, x: np.log(alpha) + alpha*np.log(5e5) - (alpha + 1)*np.log(x)

如果不简化,您的程序仍然需要尝试计算 5e5**alpha 项,它会变得非常快(溢出 55 左右)。

您还需要向 minimize 提供 bounds 参数,以防止 log 中出现任何负数。