如何通过嵌套函数链传递参数来计算结果?

How can I pass arguments through a chain of nested functions to calculate a result?

我的问题很快,但我提供了一段代码来更好地说明我的问题,因为我没有理解阅读相关帖子的答案。

下面的代码用于选择作为 args 列表一部分的优化参数。 args 列表应该是单个条目 like x0 on scipy docs。我希望找到正确的 args 组合以最好地拟合数据。 scipy 优化模块应该会波动我的 args 的值,以找到最大程度地减少我的错误的组合。但是,我无法将参数从一个函数传递到另一个函数。

有时我会输入 ***,但我的成功率是失误多于命中。我想知道如何将 args 从一个函数传递到另一个函数,同时允许它们更改值以找到它们的优化值。 (优化值减少了错误,如下所述)。我有一些函数可以作为其他函数的输入,但这里缺少一个关键概念。 kwargs 对这样的事情有必要吗?如果 args 是一个元组,它们还能改变值来找到优化的参数吗?我知道在 SO 上有人提出了一些类似的问题,但我还不能用这些资源来解决这个问题。

代码解释如下(导入后)。

import numpy as np
import random
import matplotlib.pyplot as plt
from math import exp
from math import log
from math import pi
from scipy.integrate import quad ## integrate f(x)dx from x_i to x_i+1
from scipy.stats import norm
from scipy.stats import chisquare
from scipy.optimize import basinhopping
from scipy.stats import binned_statistic as bstat

我生成了一个包含 1000 个数据点的随机高斯分布样本,平均 mu = 48,标准差 sigma = 7。我可以对数据进行直方图绘制,我的目标是找到参数 mu、sigma 和normc(比例因子或归一化常数)产生最适合样本数据直方图的值。有许多错误分析方法,但就我的目的而言,最佳拟合被确定为最小化卡方的拟合(下面将进一步描述)。我知道代码很长(甚至太长),但我的问题需要一些设置。

## generate data sample
a, b = 48, 7 ## mu, sigma
randg = []
for index in range( 1000 ):
    randg.append( random.gauss(a,b) )
data = sorted( randg )

small = min( data )
big = max( data )
domain = np.linspace(small,big,3000) ## for fitted plot overlay on histogram of data

然后我为直方图组织了 bin。

numbins = 30 ## number of bins

def binbounder( small , big , numbins ):
    ## generates list of bound bins for histogram ++ bincount
    binwide = ( big - small ) / numbins ## binwidth
    binleft = [] ## left edges of bins
    for index in range( numbins ):
        binleft.append( small + index * binwide )
    binbound = [val for val in binleft]
    binbound.append( big ) ## all bin edges
    return binbound

binborders = binbounder( small , big , numbins )
## useful if one performs plt.hist(data, bins = binborders, ...)

def binmidder( small , big , numbins ):
    ## all midtpts of bins
    ## for x-ticks on histogram
    ## useful to visualize over/under -estimate of error
    binbound = binbounder( small , big , numbins )
    binwide = ( big - small ) / numbins
    binmiddles = []
    for index in range( len( binbound ) - 1 ):
        binmiddles.append( binwide/2 + index * binwide )
    return binmiddles

binmids = binmidder( small , big , numbins )

要执行卡方分析,必须输入每个 bin 的期望值 (E_i) 和每个 bin 的观测值的重数 (O_i) 和方块的 output the sum over all the bins它们与每个 bin 的期望值的差异。

def countsperbin( xdata , args = [ small , big , numbins ]):
    ## calculates multiplicity of observed values per bin
    binborders = binbounder( small , big , numbins )
    binmids = binmidder( small , big , numbins )
    values = sorted( xdata ) ## function(xdata) ~ f(x)
    bincount = []
    for jndex in range( len( binborders ) ):
        if jndex != len( binborders ) - 1:
            summ = 0
            for val in values:
                if val > binborders[ jndex ] and val <= binborders[ jndex + 1 ]:
                    summ += 1
            bincount.append( summ )
        if jndex == len( binborders ) - 1:
            pass
    return bincount

obsperbin = countsperbin( binborders , data ) ## multiplicity of observed values per bin

计算和最小化卡方所需的每个区间的每个期望值定义为分布函数从 x_i = 左边界到 x_i+1 = 右边界的积分.

我想要对我的优化参数进行合理的初始猜测,因为这些会给我一个对最小化卡方的合理猜测。我选择 mu、sigma 和 normc 接近但不等于它们的真实值,以便我可以测试最小化是否有效。

def maxbin( perbin ):
    ## perbin is a list of observed data per bin
    ## returns largest multiplicity of observed values with index
    ## useful to help guess scaling factor "normc" (outside exponential in GaussDistrib)
    for index, maxval in enumerate( perbin ):
        if maxval == max( perbin ):
            optindex = index
    return optindex, perbin[ optindex ] 

mu, sigma, normc = np.mean( data ) + 30, np.std( data ) + 20, maxbin( obsperbin )

由于我们对 f(x)dx 进行积分,因此数据点(或 xdata)与此处无关。

def GaussDistrib( xdata , args = [ mu , sigma , normc ] ): ## G(x)
    return normc * exp( (-1) * (xdata - mu)**2 / (2 * sigma**2) )

def expectperbin( args ):
    ## calculates expectation values per bin
    ## needed with observation values per bin for ChiSquared
    ## expectation value of single bin is equal to area under Gaussian curve from left binedge to right binedge
    ## area under curve for ith bin = integral G(x)dx from x_i (left edge) to x_i+1 (right edge)
    ans = []
    for index in range(len(binborders)-1): # ith index does not exist for rightmost boundary
        ans.append( quad( GaussDistrib , binborders[ index ] , binborders[ index + 1 ], args = [ mu , sigma , normc ])[0])
    return ans

我定义的函数 chisq 从 scipy 模块调用 chisquare 到 return 结果。

def chisq( args ):
    ## args[0] = mu
    ## args[1] = sigma
    ## args[2] = normc
    ## last subscript [0] gives chi-squared value, [1] gives 0 ≤ p-value ≤ 1
    ## can also minimize negative p-value to find best fitting chi square
    return chisquare( obsperbin , expectperbin( args[0] , args[1] , args[2] ))[0]

我不知道怎么做,但我想对我的系统施加限制。具体来说,分箱数据的高度列表的最大值必须大于零(卡方必须大于零,因为微分后仍然存在指数项)。

def miniz( chisq , chisqguess , niter = 200 ):
    minimizer = basinhopping( chisq , chisqguess , niter = 200 )
    ## Minimization methods available via https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html
    return minimizer

expperbin = expectperbin( args = [mu , sigma , normc] )
# chisqmin = chisquare( obsperbin , expperbin )[0]
# chisqmin = result.fun

""" OPTIMIZATION """

print("")
print("initial guess of optimal parameters")

initial_mu, initial_sigma, initial_normc = np.mean(data)+30 , np.std(data)+20 , maxbin( (obsperbin) )
## check optimized result against:  mu = 48, sigma = 7 (via random number generator for Gaussian Distribution)

chisqguess = chisquare( obsperbin , expectperbin( args[0] , args[1] , args[2] ))[0]
## initial guess for optimization

result = miniz( chisqguess, args = [mu, sigma, normc] )
print(result)
print("")

最小化的目的是找到最适合的优化参数。

optmu , optsigma , optnormc = result.x[0], abs(result.x[1]), result.x[2]

chisqcheck = chisquare(obsperbin, expperbin)
chisqmin = result.fun
print("chisqmin --  ",chisqmin,"        ",chisqcheck," --   check chi sq")

print("""
""")

## CHECK
checkbins = bstat(xdata, xdata, statistic = 'sum', bins = binborders) ## via SCIPY (imports)
binsum = checkbins[0]
binedge = checkbins[1]
binborderindex = checkbins[2]
print("binsum",binsum)
print("")
print("binedge",binedge)
print("")
print("binborderindex",binborderindex)
# Am I doing this part right?

tl;dr:我想要 result,它调用函数 minimiz,它调用 scipy 模块以使用猜测值最小化卡方。 Chi Squared 和猜测值各自调用其他函数等。如何通过正确的方式传递我的参数?

您可以访问从 optimize.basinhopping 返回的 OptimizeResult 中的所有信息。

我已经抽象出随机样本的生成,并将您的函数数量减少到 运行 优化真正需要的那 5 个函数。

参数传递中唯一的"tricky"部分是将参数musigma传递给GaussDistrib 函数在 quad 调用中,但在 quad doc 中很容易解释。除此之外,我没有看到这里传递参数的真正问题。

您长时间使用 normc 是被误导了。您不会以这种方式从高斯获得正确的值(当 2 个足够时,无需改变 3 个独立参数)。此外,要获得卡方的正确值,您必须将高斯分布的概率与样本计数相乘(您将 obsperbin 区间的绝对计数与高斯分布下的概率进行比较——这显然是错误的)。

from math import exp
from math import pi
from scipy.integrate import quad
from scipy.stats import chisquare
from scipy.optimize import basinhopping


# smallest value in the sample
small = 26.55312337811099
# largest value in the sample
big   = 69.02965763016027

# a random sample from N(48, 7) with 999 sample
# values binned into 30 equidistant bins ranging
# from 'small' (bin[0] lower bound) to 'big'
# (bin[29] upper bound) 
obsperbin = [ 1,  1,  2,  4,  8, 10, 13, 29, 35, 45,
             51, 56, 63, 64, 96, 89, 68, 80, 61, 51,
             49, 30, 34, 19, 22,  3,  7,  5,  1,  2]

numbins = len(obsperbin) #  30
numobs  = sum(obsperbin) # 999

# intentionally wrong guesses of mu and sigma
# to be provided as optimizer's initial values
initial_mu, initial_sigma = 78.5, 27.0


def binbounder( small , big , numbins ):
    ## generates list of bound bins for histogram ++ bincount
    binwide = ( big - small ) / numbins ## binwidth
    binleft = [] ## left edges of bins
    for index in range( numbins ):
        binleft.append( small + index * binwide )
    binbound = [val for val in binleft]
    binbound.append( big ) ## all bin edges
    return binbound

# setup the bin borders
binborders = binbounder( small , big , numbins )


def GaussDistrib( x , mu , sigma ):
    return 1/(sigma * (2*pi)**(1/2)) * exp( (-1) * (x - mu)**2 / ( 2 * (sigma **2) ) )


def expectperbin( musigma ):
    ## musigma[0] = mu
    ## musigma[1] = sigma
    ## calculates expectation values per bin
    ## expectation value of single bin is equal to area under Gaussian
    ## from left binedge to right binedge multiplied by the sample size
    e = []
    for i in range(len(binborders)-1): # ith i does not exist for rightmost boundary
        e.append( quad( GaussDistrib , binborders[ i ] , binborders[ i + 1 ],
                         args = ( musigma[0] , musigma[1] ))[0] * numobs)
    return e


def chisq( musigma ):
    ## first subscript [0] gives chi-squared value, [1] gives 0 = p-value = 1
    return chisquare( obsperbin , expectperbin( musigma ))[0]


def miniz( chisq , musigma ):
    return basinhopping( chisq , musigma , niter = 200 )


## chisquare value for initial parameter guess
chisqguess = chisquare( obsperbin , expectperbin( [initial_mu , initial_sigma] ))[0]

res = miniz( chisq, [initial_mu , initial_sigma] )

print("chisquare from initial guess:" , chisqguess)
print("chisquare after optimization:" , res.fun)
print("mu, sigma after optimization:" , res.x[0], ",", res.x[1])

chisquare from initial guess: 3772.70822797

chisquare after optimization: 26.351284911784447

mu, sigma after optimization: 48.2701027439, 7.046156286

顺便说一句,basinhopping is overkill for this kind of problem. I'd stay with fmin (Nelder-Mead)。