Python 用于获取逻辑回归的最大似然估计量的包

Python package for getting the maximum likelihood estimator for logistic regression

由于这篇post很长,我将我的问题放在这里:

1。 python 中是否有一个包可以为我提供最大似然估计参数,对于给定数量的参数 p,对于协变量 x 和数据值 y? (最好有关于如何实施它的综合文档)

2。这种方法是否可扩展,即如果我尝试 O(100) 协变量和类似数量的数据样本,它会起作用吗?

背景:

我正在尝试使用 python 研究诸如具有不同样本数 n /协变量 p 的最大似然估计量的分布。我的脚本可以很好地生成逻辑回归数据,但我无法使任何参数估计方法(即最大化对数似然的参数值)正常工作。

我尝试过的方法:

-编写我自己的 Newton Raphson 程序版本。但是我的估计中的错误在反复迭代时出现分歧(当然我检查了明显的符号和不等式错误!)

-使用牛顿共轭梯度实现。这也失败了,因为错误 'a truth value of an array is ambiguous'。当我编写自己的版本时,我可以使用 all() 来解决这个问题,但在使用包时则不行。奇怪,因为我认为 Newton CG 应该适用于多变量情况!

-最后,我打算使用包 statsmodels。我不知道我是否真的很迟钝,但我找不到任何关于此的全面文档?为了获得逻辑回归参数,我发现最好的是 this,我从中尝试过,例如

X,y = logit_data(np.power(10,6),p,theta)

y=np.reshape(y, (len(y),))

clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(X, y)

thetaEst = clf.get_params(X, y)

我试过最后一行:

thetaEst = clf.get_params()

但似乎没有任何东西可以让我估计参数值。我要么有错误,要么有我不期望的对象。肯定有一个 python 包可以做到这一点?!?当然我不应该求助于 R (我不知道 R D:!!!)

可选 代码

我不想让这个 post 更长,但我相信有人会要求它。所以我已经放入了实现 Newton Raphson 的代码,并且我一直在用现有的包替换这个函数:

#Script for producing y data from p covariates from a specified distribution, specified theta paraemters,
#and n data samples for logit link function.

import numpy as np
import matplotlib.pyplot as plt


#Define link function here
def g(z):
    g=1/(1+np.exp(-z))
    return g

#For producing y data values given true paramters theta and number of covariates
def logit_data(n,p, theta):
    #Define parameters

    #1)Number of covariates
    p_i = p+1  #with intercept
    p_i=np.int(p_i)

    #2) m as correct data type
    n=np.int(n)

    #4)Specify parameter valueas to be estimated
    theta=np.reshape(theta, (p_i,1))

    #5)Define distribution from which covariate values are drawn i.i.d., and initiate data values
    X=np.zeros((n,p_i))
    X[:,0]=1    #intercept
    mean=0
    sigma=1.5

    X[:,1:]=np.random.normal(mean,sigma,(n,p))


    #6)Produce y values treating y as a Bernoulli variable with p=g(X*theta)
    r=np.random.uniform(0,1,n)
    r=np.reshape(r, (len(r),1))
    htrue=g(X.dot(theta))
    y=htrue-r
    y[y>=0]=1
    y[y<0]=0

    return X, y





#Newton Raphson implementation
def NewtonRaphson(X,y):
    ##NOTE: All functions negloglikelihood, gradf, hessian, return the values for f = (-ve of the log likelihood function)
    #, to use NR method to minimise f (rather than maximise l)

    #Define log likelihood function to be maximised
    def negloglikelihood(y,h):
        l= y.transpose() @ np.log(h) + (1-y).transpose() @ np.log(1-h)
        f=-l
        return f

    #Define gradient of log likelihood function
    def gradf(y, h, X):
        a=(y-h).transpose()
        gradl= np.matmul(a,X)
        grad_f=-gradl
        return grad_f

    #Define second derivative (Hessian) of log likelihood function
    def hessian(h, X):
        D=np.identity(len(h))*(np.matmul(h,(1-h).transpose()))
        H=-X.transpose() @ D @ X
        Hf=-H
        return Hf


    #Minimise f=-l

    #Produce initial theta estimate and probability parameter h
    np.random.seed(555)
    thetaEst=np.random.normal(1.1, 0.4, 6)

    eta=np.matmul(X,thetaEst)
    h=g(eta)

    #While not at a manimum of f

    #control constants
    a=10e-8
    b=10e-8
    i=0
    j=0
    k=0

    while not (np.linalg.norm(gradf(y,h,X)) < np.absolute(negloglikelihood(y,h)) * a + b):
        i=i+1
        #print('i = %s' %i)
        h=g(np.matmul(X,thetaEst)) 
        H=hessian(h,X)      #Cholesky decomposition to ensure hessian (of f) is positive semi-definite)
       # print(H)
        try:
            np.linalg.cholesky(H)
            #print('j = %s' %j)
            j=j+1
        except np.linalg.LinAlgError:
            print('Hessian not positive semi-definite!')
            try:
                v,w=np.linalg.eig(H)
               # print(v,w)
                v=np.absolute(v)
                H=w @ np.diag(v) @ np.linalg.inv(w)
            except:
                return thetaEst


        delta = 0

        try:
            delta=np.linalg.solve(H, np.reshape(gradf(y,h,X),(6,1))) #Solve for incremental theta step
            #print('k = %s' %k)
            k=k+1
        except:
            return thetaEst   #Simply return theta estimate if have singular hessian

        while negloglikelihood(y, h) > negloglikelihood(y, g(np.matmul(X,thetaEst+delta))):
            print('!!')
            delta=0.5*delta               #Ensure added step is sufficently small so as not to diverge

        thetaEst=thetaEst+delta

    return thetaEst









#Main control

#1)Sample numbers to test for erros in beta, as powers of 10.
npowers=np.arange(1,2,0.05)
n=np.power(10,npowers)


#2)Number of independent covariates
p=5


#3)True theta to be estimated (parameter values)
theta=np.asarray([1,1.2,1.1,0.8,0.9,1.3])


#4)#Initiate arrays to store estimates of theta (and errors) computed at specified sample numbers N
Thetas=np.zeros((len(npowers),p+1))
Errors=np.zeros((len(npowers),p+1))


#5)Obtain random covariate values from specified distribution, and corresponding y values using true theta
#plus gaussian noise term.
X,y = logit_data(n[-1],p,theta)


#6)Calulcate cumulative means for given n values, for the theta estimates
for ind,N in enumerate(n):
    N=np.int(N)
    thetaTemp=NewtonRaphson(X[0:N,:],y[0:N])
    Thetas[ind,:] = np.reshape(thetaTemp,6)



#7)Calculate true erros 
#print(Thetas)
Errors=Thetas-theta.transpose()
absError=np.abs(Errors)
nerror=Errors*np.sqrt(n)[:,np.newaxis]
logerror = np.log10(absError)

#8)Save data as csv


#9)Plots

plt.scatter(X@theta, g(X@theta))
plt.scatter(X@theta,y)
plt.show()



fig=plt.figure()
for i in range(p+1):
    plt.plot(npowers, logerror[:,i])

fig.suptitle('log10(Absolute Error) in MLE against log10(Number of samples,N) for logistic regression')
plt.xlabel('log_10(N)')
plt.ylabel('log_10(Absolute Error)')

fig.savefig('logiterrors7.png')
plt.show()

可以在 statsmodels 中找到有关逻辑回归模型的文档 here, for the latest development version. All models follow a familiar series of steps, so this should provide sufficient information to implement it in practice (do make sure to have a look at some examples, e.g. here)。一般来说,我不建议您使用 re-implementing solvers/models 已经在 scipystatsmodels 中提供,除非您有非常特殊的需求。

现在,我已经使用您的脚本生成了一些数据,并使用 Logit 模型来估计参数,如下所示,

import statsmodels.api as sm

X, y = logit_data(n[-1], p, theta)

model = sm.Logit(y, X)
result = model.fit()

print(result.summary())

此输出(您的里程可能会有所不同),

Optimization terminated successfully.
         Current function value: 0.203609
         Iterations 9
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                   89
Model:                          Logit   Df Residuals:                       83
Method:                           MLE   Df Model:                            5
Date:                Wed, 17 Jul 2019   Pseudo R-squ.:                  0.7062
Time:                        13:40:37   Log-Likelihood:                -18.121
converged:                       True   LL-Null:                       -61.684
                                        LLR p-value:                 2.695e-17
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.0735      0.540      1.986      0.047       0.014       2.133
x1             2.0890      0.594      3.518      0.000       0.925       3.253
x2             1.7191      0.459      3.746      0.000       0.820       2.618
x3             1.7228      0.464      3.713      0.000       0.813       2.632
x4             1.1897      0.410      2.902      0.004       0.386       1.993
x5             2.2008      0.653      3.370      0.001       0.921       3.481
==============================================================================

Possibly complete quasi-separation: A fraction 0.10 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.

可以通过result.params访问这些系数,如下所示,

>>>result.params
[1.0734945  2.08898192 1.71907914 1.72278748 1.18972079 2.20079805]