广义逻辑回归Python:如何正确定义二元自变量?
Generalized logistic regression Python: how to correctly define binary independent variable?
我正在使用加权广义线性模型 (statsmodels) 进行分类:
import statsmodels.api as sm
model= sm.GLM(y, x_with_intercept, max_iter=500, random_state=42, family=sm.families.Binomial(),freq_weights=weights)
x_with_intercept 中的一个变量是二进制的。我认为包含二元变量不会有问题,尽管与其他变量相比,模型的输出会产生极高的标准误差。
有没有办法在模型定义中正确说明二进制变量?
看看你的系数,即使是截距也有很高的标准误差,这很奇怪。很可能你的二进制变量的正例太少并且与你的其他变量混淆,所以它在估计时有问题。
例如,如果我们有一个正常的二进制预测器,没问题:
import numpy as np
import pandas as pd
import statsmodels.api as sm
np.random.seed(123)
x = pd.DataFrame({'b1':np.random.binomial(1,0.5,100),
'b2':np.random.binomial(1,0.5,100),
'c1':np.random.uniform(0,1,100),
'c2':np.random.normal(0,1,100)})
y = np.random.binomial(1,0.5,100)
wt = np.random.poisson(10,100)
model= sm.GLM(y, sm.add_constant(x), max_iter=500,
random_state=42, family=sm.families.Binomial(),
freq_weights=wt)
results = model.fit()
看起来正常:
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.3858 0.149 -2.592 0.010 -0.677 -0.094
b1 0.1965 0.132 1.490 0.136 -0.062 0.455
b2 0.6114 0.140 4.362 0.000 0.337 0.886
c1 -0.5621 0.214 -2.621 0.009 -0.982 -0.142
c2 -0.5108 0.072 -7.113 0.000 -0.651 -0.370
==============================================================================
我们使其中一个二进制变量只有 1 个正值:
x['b1'] = (x['c1']>0.99).astype(int)
x.b1.sum()
2
model= sm.GLM(y, sm.add_constant(x), max_iter=500,
random_state=42, family=sm.families.Binomial(),
freq_weights=wt)
results = model.fit()
我得到了巨大的标准错误:
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 100
Model: GLM Df Residuals: 1039
Model Family: Binomial Df Model: 4
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -675.80
Date: Sat, 24 Apr 2021 Deviance: 1351.6
Time: 11:39:50 Pearson chi2: 1.02e+03
No. Iterations: 22
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.2151 0.128 -1.685 0.092 -0.465 0.035
b1 23.6170 1.73e+04 0.001 0.999 -3.4e+04 3.4e+04
b2 0.6002 0.138 4.358 0.000 0.330 0.870
c1 -0.7671 0.220 -3.482 0.000 -1.199 -0.335
c2 -0.4366 0.072 -6.037 0.000 -0.578 -0.295
==============================================================================
我会检查是否值得包含该变量
我正在使用加权广义线性模型 (statsmodels) 进行分类:
import statsmodels.api as sm
model= sm.GLM(y, x_with_intercept, max_iter=500, random_state=42, family=sm.families.Binomial(),freq_weights=weights)
x_with_intercept 中的一个变量是二进制的。我认为包含二元变量不会有问题,尽管与其他变量相比,模型的输出会产生极高的标准误差。
有没有办法在模型定义中正确说明二进制变量?
看看你的系数,即使是截距也有很高的标准误差,这很奇怪。很可能你的二进制变量的正例太少并且与你的其他变量混淆,所以它在估计时有问题。
例如,如果我们有一个正常的二进制预测器,没问题:
import numpy as np
import pandas as pd
import statsmodels.api as sm
np.random.seed(123)
x = pd.DataFrame({'b1':np.random.binomial(1,0.5,100),
'b2':np.random.binomial(1,0.5,100),
'c1':np.random.uniform(0,1,100),
'c2':np.random.normal(0,1,100)})
y = np.random.binomial(1,0.5,100)
wt = np.random.poisson(10,100)
model= sm.GLM(y, sm.add_constant(x), max_iter=500,
random_state=42, family=sm.families.Binomial(),
freq_weights=wt)
results = model.fit()
看起来正常:
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.3858 0.149 -2.592 0.010 -0.677 -0.094
b1 0.1965 0.132 1.490 0.136 -0.062 0.455
b2 0.6114 0.140 4.362 0.000 0.337 0.886
c1 -0.5621 0.214 -2.621 0.009 -0.982 -0.142
c2 -0.5108 0.072 -7.113 0.000 -0.651 -0.370
==============================================================================
我们使其中一个二进制变量只有 1 个正值:
x['b1'] = (x['c1']>0.99).astype(int)
x.b1.sum()
2
model= sm.GLM(y, sm.add_constant(x), max_iter=500,
random_state=42, family=sm.families.Binomial(),
freq_weights=wt)
results = model.fit()
我得到了巨大的标准错误:
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 100
Model: GLM Df Residuals: 1039
Model Family: Binomial Df Model: 4
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -675.80
Date: Sat, 24 Apr 2021 Deviance: 1351.6
Time: 11:39:50 Pearson chi2: 1.02e+03
No. Iterations: 22
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.2151 0.128 -1.685 0.092 -0.465 0.035
b1 23.6170 1.73e+04 0.001 0.999 -3.4e+04 3.4e+04
b2 0.6002 0.138 4.358 0.000 0.330 0.870
c1 -0.7671 0.220 -3.482 0.000 -1.199 -0.335
c2 -0.4366 0.072 -6.037 0.000 -0.578 -0.295
==============================================================================
我会检查是否值得包含该变量