Scikit python 和 R 中的逻辑回归结果不同？

Question

我在 R 和 Python.But 上对虹膜数据集进行了运行逻辑回归，两者都给出了不同的结果（系数、截距和分数）。

#Python codes.
    In[23]: iris_df.head(5)
    Out[23]: 
     Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
    0           5.1          3.5           1.4          0.2        0
    1           4.9          3.0           1.4          0.2        0
    2           4.7          3.2           1.3          0.2        0
    3           4.6          3.1           1.5          0.2        0
    In[35]: iris_df.shape
    Out[35]: (100, 5)
    #looking at the levels of the Species dependent variable..

        In[25]: iris_df['Species'].unique()
        Out[25]: array([0, 1], dtype=int64)

    #creating dependent and independent variable datasets..

        x = iris_df.ix[:,0:4]
        y = iris_df.ix[:,-1]

    #modelling starts..
    y = np.ravel(y)
    logistic = LogisticRegression()
    model = logistic.fit(x,y)
    #getting the model coefficients..
    model_coef= pd.DataFrame(list(zip(x.columns, np.transpose(model.coef_))))
    model_intercept = model.intercept_
    In[30]: model_coef
    Out[36]: 
                  0                  1
    0  Sepal.Length  [-0.402473917528]
    1   Sepal.Width   [-1.46382924771]
    2  Petal.Length    [2.23785647964]
    3   Petal.Width     [1.0000929404]
    In[31]: model_intercept
    Out[31]: array([-0.25906453])
    #scores...
    In[34]: logistic.predict_proba(x)
    Out[34]: 
    array([[ 0.9837306 ,  0.0162694 ],
           [ 0.96407227,  0.03592773],
           [ 0.97647105,  0.02352895],
           [ 0.95654126,  0.04345874],
           [ 0.98534488,  0.01465512],
           [ 0.98086592,  0.01913408],

R 代码。

> str(irisdf)
'data.frame':   100 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : int  0 0 0 0 0 0 0 0 0 0 ...

 > model <- glm(Species ~ ., data = irisdf, family = binomial)
Warning messages:
1: glm.fit: algorithm did not converge 
2: glm.fit: fitted probabilities numerically 0 or 1 occurred 
> summary(model)

Call:
glm(formula = Species ~ ., family = binomial, data = irisdf)

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-1.681e-05  -2.110e-08   0.000e+00   2.110e-08   2.006e-05  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)       6.556 601950.324       0        1
Sepal.Length     -9.879 194223.245       0        1
Sepal.Width      -7.418  92924.451       0        1
Petal.Length     19.054 144515.981       0        1
Petal.Width      25.033 216058.936       0        1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.3863e+02  on 99  degrees of freedom
Residual deviance: 1.3166e-09  on 95  degrees of freedom
AIC: 10

Number of Fisher Scoring iterations: 25

由于收敛问题，我增加了最大迭代次数并给epsilon 0.05。

> model <- glm(Species ~ ., data = irisdf, family = binomial,control = glm.control(epsilon=0.01,trace=FALSE,maxit = 100))
> summary(model)

Call:
glm(formula = Species ~ ., family = binomial, data = irisdf, 
    control = glm.control(epsilon = 0.01, trace = FALSE, maxit = 100))

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-0.0102793  -0.0005659  -0.0000052   0.0001438   0.0112531  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)     1.796    704.352   0.003    0.998
Sepal.Length   -3.426    215.912  -0.016    0.987
Sepal.Width    -4.208    123.513  -0.034    0.973
Petal.Length    7.615    159.478   0.048    0.962
Petal.Width    11.835    285.938   0.041    0.967

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.3863e+02  on 99  degrees of freedom
Residual deviance: 5.3910e-04  on 95  degrees of freedom
AIC: 10.001

Number of Fisher Scoring iterations: 12

#R scores..
> scores = predict(model, newdata = irisdf, type = "response")
> head(scores,5)
           1            2            3            4            5 
2.844996e-08 4.627411e-07 1.848093e-07 1.818231e-06 2.631029e-08

R 中的分数、截距和系数都完全不同，python.Which 一个是正确的，我想继续 python.Now 混淆哪个结果是准确的。

Answer 1

已更新问题是沿花瓣宽度变量存在完美分离。换句话说，这个变量可以用来完美地预测给定数据集中的样本是setosa还是versicolor。这打破了 R 中逻辑回归中使用的对数似然最大化估计。问题在于，通过将花瓣宽度的系数取为无穷大，可以将对数似然推得非常高。

一些背景和策略是discussed here。

还有一个很好的thread on CrossValidated讨论策略。

那么为什么 sklearn LogisticRegression 有效呢？因为它雇用了"regularized logistic regression"。正则化会惩罚估计参数的大值。

在下面的示例中，我使用 Firth 的逻辑回归包 logistf 的偏差减少方法来生成收敛模型。

library(logistf)

iris = read.table("path_to _iris.txt", sep="\t", header=TRUE)
iris$Species <- as.factor(iris$Species)
sapply(iris, class)

model1 <- glm(Species ~ ., data = irisdf, family = binomial)
# Does not converge, throws warnings.

model2 <- logistf(Species ~ ., data = irisdf, family = binomial)
# Does converge.

原创根据 R 解决方案中的 std.error 和 z 值，我认为您的模型规格不正确。接近 0 的 z 值基本上告诉您模型和因变量之间没有相关性。所以这是一个无意义的模型。

我的第一个想法是您需要将该 Species 字段转换为分类变量。在您的示例中，它是 int 类型。尝试使用 as.factor

How to convert integer into categorical data in R?

Scikit python 和 R 中的逻辑回归结果不同？

Logistic regression results different in Scikit python and R?

python

regression

r

machine-learning

logistic-regression

R 代码。