广义逻辑回归Python：如何正确定义二元自变量？

Question

我正在使用加权广义线性模型 (statsmodels) 进行分类：

import statsmodels.api as sm
model= sm.GLM(y, x_with_intercept, max_iter=500, random_state=42, family=sm.families.Binomial(),freq_weights=weights)

x_with_intercept 中的一个变量是二进制的。我认为包含二元变量不会有问题，尽管与其他变量相比，模型的输出会产生极高的标准误差。

有没有办法在模型定义中正确说明二进制变量？

Answer 1

看看你的系数，即使是截距也有很高的标准误差，这很奇怪。很可能你的二进制变量的正例太少并且与你的其他变量混淆，所以它在估计时有问题。

例如，如果我们有一个正常的二进制预测器，没问题：

import numpy as np
import pandas as pd
import statsmodels.api as sm

np.random.seed(123)

x = pd.DataFrame({'b1':np.random.binomial(1,0.5,100),
'b2':np.random.binomial(1,0.5,100),
'c1':np.random.uniform(0,1,100),
'c2':np.random.normal(0,1,100)})

y = np.random.binomial(1,0.5,100)
wt = np.random.poisson(10,100)

model= sm.GLM(y, sm.add_constant(x), max_iter=500, 
random_state=42, family=sm.families.Binomial(),
freq_weights=wt)

results = model.fit()

看起来正常：

==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.3858      0.149     -2.592      0.010      -0.677      -0.094
b1             0.1965      0.132      1.490      0.136      -0.062       0.455
b2             0.6114      0.140      4.362      0.000       0.337       0.886
c1            -0.5621      0.214     -2.621      0.009      -0.982      -0.142
c2            -0.5108      0.072     -7.113      0.000      -0.651      -0.370
==============================================================================

我们使其中一个二进制变量只有 1 个正值：

x['b1'] = (x['c1']>0.99).astype(int)

x.b1.sum()
2

model= sm.GLM(y, sm.add_constant(x), max_iter=500, 
random_state=42, family=sm.families.Binomial(),
freq_weights=wt)

results = model.fit()

我得到了巨大的标准错误：

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  100
Model:                            GLM   Df Residuals:                     1039
Model Family:                Binomial   Df Model:                            4
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -675.80
Date:                Sat, 24 Apr 2021   Deviance:                       1351.6
Time:                        11:39:50   Pearson chi2:                 1.02e+03
No. Iterations:                    22                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.2151      0.128     -1.685      0.092      -0.465       0.035
b1            23.6170   1.73e+04      0.001      0.999    -3.4e+04     3.4e+04
b2             0.6002      0.138      4.358      0.000       0.330       0.870
c1            -0.7671      0.220     -3.482      0.000      -1.199      -0.335
c2            -0.4366      0.072     -6.037      0.000      -0.578      -0.295
==============================================================================

我会检查是否值得包含该变量

广义逻辑回归Python：如何正确定义二元自变量？

Generalized logistic regression Python: how to correctly define binary independent variable?

python

glm

statsmodels

logistic-regression