使用 Monte Carlo 与 scipy.integrate.nquad 的不同积分结果

Question

下面的 MWE 显示了使用 stats.gaussian_kde() 函数为 this data 获得的相同 2D 核密度估计的两种积分方法。

对低于阈值点 (x1, y1) 的所有 (x, y) 执行积分，阈值点 (x1, y1) 定义积分上限（积分下限为 -infinity；参见 MWE）。

int1 函数使用简单的 Monte Carlo 方法。
int2函数使用了scipy.integrate.nquad函数。

问题在于 int1（即：Monte Carlo 方法）系统地给出比 int2 更大的积分值。我不知道为什么会这样。

下面是 int1 运行 200 次后获得的积分值的示例（蓝色直方图）与 int2 给出的积分结果（红色垂直线）：

结果积分值差异的来源是什么？

MWE

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy import integrate


def int1(kernel, x1, y1):
    # Compute the point below which to integrate
    iso = kernel((x1, y1))

    # Sample KDE distribution
    sample = kernel.resample(size=50000)

    # Filter the sample
    insample = kernel(sample) < iso

    # The integral is equivalent to the probability of drawing a
    # point that gets through the filter
    integral = insample.sum() / float(insample.shape[0])

    return integral


def int2(kernel, x1, y1):

    def f_kde(x, y):
        return kernel((x, y))

    # 2D integration in: (-inf, x1), (-inf, y1).
    integral = integrate.nquad(f_kde, [[-np.inf, x1], [-np.inf, y1]])

    return integral


# Obtain data from file.
data = np.loadtxt('data.dat', unpack=True)
# Perform a kernel density estimate (KDE) on the data
kernel = stats.gaussian_kde(data)

# Define the threshold point that determines the integration limits.
x1, y1 = 2.5, 1.5

i2 = int2(kernel, x1, y1)
print i2

int1_vals = []
for _ in range(200):
    i = int1(kernel, x1, y1)
    int1_vals.append(i)
    print i

添加

注意这个问题来自 this answer。起初我没有注意到答案在使用的积分限制中是错误的，这解释了为什么 int1 和 int2 之间的结果不同。

int1 在域 f(x,y)<f(x1,y1) 中积分（其中 f 是核密度估计），而 int2 在域 (x,y)<(x1,y1).[=33= 中积分]

Answer 1

您对分布重新采样

sample = kernel.resample(size=50000)

然后计算每个采样点的概率小于边界处的概率

insample = kernel(sample) < iso

这是不正确的。考虑边界 (0,100) 并假设您的数据具有 u=(0,0) 和 cov=[[100,0],[0,100]]。点 (0,50) 和 (50,0) 在该内核中具有相同的概率，但只有其中一个在边界内。由于两者都通过了测试，因此您过度采样了。

您应该测试 sample 中的每个点是否在边界内，然后计算概率。像

def int1(kernel, x1, y1):
    # Sample KDE distribution                                                                                                              
    sample = kernel.resample(size=100)

    include = (sample < np.repeat([[x1],[y1]],sample.shape[1],axis=1)).all(axis=0)
    integral = include.sum() / float(sample.shape[1])
    return integral

我使用以下代码对此进行了测试

def measure(n):

    m1 = np.random.normal(size=n)
    m2 = np.random.normal(size=n)
    return m1,m2

a = scipy.stats.gaussian_kde( np.vstack(measure(1000)) )
print(int1(a,-10,-10))
print(int2(a,-10,-10))
print(int1(a,0,0))
print(int2(a,-0,-0))

产量

0.0
(4.304674927251112e-232, 4.6980863813551415e-230)
0.26
(0.25897626178338407, 1.4536217446381293e-08)

Monte Carlo 集成应该像这样工作

在 x/y 的可能值的某些子集上采样 N 个随机值（均匀地，而不是来自您的分布）（下面我将其与均值相差 10 个标准差）。
对于每个随机值计算内核(rand_x,rand_y)
计算总和并乘以（体积）/N_samples

在代码中：

def mc_wo_sample(kernel,x1,y1,lboundx,lboundy):
    nsamples = 50000
    volume = (x1-lboundx)*(y1-lboundy)
    # generate uniform points in range                                                                                                     
    xrand = np.random.rand(nsamples,1)*(x1-lboundx) + lboundx
    yrand = np.random.rand(nsamples,1)*(y1-lboundy) + lboundy
    randvals = np.hstack((xrand,yrand)).transpose()
    print randvals.shape
    return (volume*kernel(randvals).sum())/nsamples

运行以下

   print(int1(a,-9,-9))
   print(int2(a,-9,-9))
   print(mc_wo_sample(a,-9,-9,-10,-10))
   print(int1(a,0,0))
   print(int2(a,-0,-0))
   print(mc_wo_sample(a,0,0,-10,-10))

产量

0.0
(4.012958496109042e-70, 6.7211236076277e-71)
4.08538890986e-70
0.36
(0.37101621760650216, 1.4670898180664756e-08)
0.361614657674

使用 Monte Carlo 与 scipy.integrate.nquad 的不同积分结果

Different integration results using Monte Carlo vs scipy.integrate.nquad

python

integral

montecarlo

scipy