Scipy：仅从尾部采样对数正态分布

Question

我正在构建一个模拟，它需要从对数正态分布的尾部随机抽取。选择阈值 τ (tau)，得到的条件分布由下式给出：

我需要从该条件分布中随机抽样，其中 F(x) 是具有所选 µ (mu) 和 σ (sigma) 的对数正态分布，而 τ (tau) 由用户设置。

我现在不优雅的解决方案只是简单地从对数正态分布中采样，丢弃 τ (tau) 下的任何值，直到我获得所需的样本量。但我相信这可以改进。

感谢您的帮助！

Answer 1

一个干净的方法是定义一个 rv_continuous 的子类并实现 _cdf。要绘制变量，您可能还需要定义 _ppf 或 _rvs 方法。

Answer 2

最简单的方法可能是利用 Scipy 提供的截断正态分布。

这给出了以下代码，其中 ν (nu) 作为标准高斯分布的变量，并且 τ (tau) 映射到该分布上的 ν₀。此函数 returns 包含 ranCount 对数正态变量的 Numpy 数组：

import  numpy  as  np
from scipy.stats import truncnorm

def getMySamplesScipy(ranCount, mu, sigma, tau):
    nu0 = (math.log(tau) - mu) / sigma              # position of tau on unit Gaussian
    xs = truncnorm.rvs(nu0, np.inf, size=ranCount)  # truncated unit normal samples
    ys = np.exp(mu + sigma * xs)                    # go back to x space
    return ys

如果由于某种原因这不合适，那么一些常用于高斯变量的技巧，例如 Box-Muller do not work for a truncated distribution, but we can resort always to a general principle: the Inverse Transform Sampling 定理。

因此，我们通过转换统一变量来为我们的变量生成累积概率。我们相信 Scipy，使用它的反函数 erf 误差函数从我们的概率返回到 x space 值。

这给出了类似下面的 Python 代码（没有任何优化尝试）：

import  math
import  random
import  numpy  as  np
import  numpy.random  as  nprd
import  scipy.special as  spfn

# using the "Inverse Method":
def getMySamples(ranCount, mu, sigma, tau):
    nu0 = (math.log(tau) - mu) / sigma      # position of tau in standard Gaussian curve
    headCP = (1/2) * (1 + spfn.erf(nu0/math.sqrt(2)))
    tailCP = 1.0 - headCP                   # probability of being in the "tail"

    uvs = np.random.uniform(0.0, 1.0, ranCount)  # uniform variates
    cps = (headCP + uvs * tailCP)                # Cumulative ProbabilitieS
    nus = (math.sqrt(2)) * spfn.erfinv(2*cps-1)  # positions in standard Gaussian
    xs  = np.exp(mu + sigma * nus)               # go back to x space
    return xs

备选方案：

我们可以利用与 Truncated Gaussian distribution.

相关的大量 material

有一个相对较新的 (2016) review paper on the subject by Zdravko Botev and Pierre L'Ecuyer. This paper provides a pointer to publicly available R source code. Some material is seriously old, for example the 1986 book by Luc Devroye: Non-Uniform Random Variate Generation.

例如，一种可能的基于拒绝的方法：如果τ(tau)映射到标准高斯曲线上的ν₀，则单位高斯分布就像exp(-ν ²/2)。如果我们写 ν = ν₀ + δ，这与以下成正比：exp(-δ²/2) * exp(-ν₀*δ).

这个想法是通过参数 ν_{0 的指数来逼近 ν₀ 之外的精确分布}。请注意，精确分布始终低于近似分布。然后我们可以随机接受相对便宜的指数变量，概率为 exp(-δ²/2).

我们可以在文献中选择一个等价的算法。在 Devroye 书中，第 IX 章第 382 页，有一些伪代码：

重复生成独立的指数随机变量 X 和 Y 直到 X² <= 2*ν₀²*Y

RETURN R <-- ν₀ + X/ν₀

可以这样编写 Numpy 演绎版：

def getMySamplesXpRj(rawRanCount, mu, sigma, tau):
    nu0 = (math.log(tau) - mu) / sigma      # position of tau in standard Gaussian
    if (nu0 <= 0):
        print("Error: τ (tau) too small in getMySamplesXpRj")
    rnu0 = 1.0 / nu0

    xs  = nprd.exponential(1.0,  rawRanCount)   # exponential "raw" variates
    ys  = nprd.exponential(1.0,  rawRanCount)

    allSamples = nu0 + (rnu0 * xs)
    boolArray  = (xs*xs - 2*nu0*nu0*ys) <= 0.0
    samples    = allSamples[boolArray]

    ys  = np.exp(mu + sigma * samples)               # go back to x space
    return ys

根据 Botev-L'Ecuyer 论文中的 Table 3，该算法的拒绝率非常低。

此外，如果您愿意考虑一些复杂的问题，ENSAE-CREST 上还有一些关于尼古拉斯·肖邦 Ziggurat algorithm as used for truncated Gaussian distributions, for example the 2012 arXiv 1201.6140 paper 的文献。

旁注： 对于最新版本的 Python，似乎您可以直接使用希腊字母作为变量名，σ 代替 sigma，τ 代替tau，就像统计书中的那样：

$ python3
Python 3.9.6 (default, Jun 29 2021, 00:00:00)
>>> 
>>> σ = 2
>>> τ = 7
>>> 
>>> στ = σ * τ
>>> 
>>> στ + 1
15
>>>

Scipy：仅从尾部采样对数正态分布

Scipy: sample lognormal distribution from tail only

python

random

simulation

normal-distribution

scipy

备选方案：