Scikit python 和 R 中的逻辑回归结果不同?
Logistic regression results different in Scikit python and R?
我在 R 和 Python.But 上对虹膜数据集进行了 运行 逻辑回归,两者都给出了不同的结果(系数、截距和分数)。
#Python codes.
In[23]: iris_df.head(5)
Out[23]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
In[35]: iris_df.shape
Out[35]: (100, 5)
#looking at the levels of the Species dependent variable..
In[25]: iris_df['Species'].unique()
Out[25]: array([0, 1], dtype=int64)
#creating dependent and independent variable datasets..
x = iris_df.ix[:,0:4]
y = iris_df.ix[:,-1]
#modelling starts..
y = np.ravel(y)
logistic = LogisticRegression()
model = logistic.fit(x,y)
#getting the model coefficients..
model_coef= pd.DataFrame(list(zip(x.columns, np.transpose(model.coef_))))
model_intercept = model.intercept_
In[30]: model_coef
Out[36]:
0 1
0 Sepal.Length [-0.402473917528]
1 Sepal.Width [-1.46382924771]
2 Petal.Length [2.23785647964]
3 Petal.Width [1.0000929404]
In[31]: model_intercept
Out[31]: array([-0.25906453])
#scores...
In[34]: logistic.predict_proba(x)
Out[34]:
array([[ 0.9837306 , 0.0162694 ],
[ 0.96407227, 0.03592773],
[ 0.97647105, 0.02352895],
[ 0.95654126, 0.04345874],
[ 0.98534488, 0.01465512],
[ 0.98086592, 0.01913408],
R 代码。
> str(irisdf)
'data.frame': 100 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : int 0 0 0 0 0 0 0 0 0 0 ...
> model <- glm(Species ~ ., data = irisdf, family = binomial)
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
> summary(model)
Call:
glm(formula = Species ~ ., family = binomial, data = irisdf)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.681e-05 -2.110e-08 0.000e+00 2.110e-08 2.006e-05
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.556 601950.324 0 1
Sepal.Length -9.879 194223.245 0 1
Sepal.Width -7.418 92924.451 0 1
Petal.Length 19.054 144515.981 0 1
Petal.Width 25.033 216058.936 0 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.3863e+02 on 99 degrees of freedom
Residual deviance: 1.3166e-09 on 95 degrees of freedom
AIC: 10
Number of Fisher Scoring iterations: 25
由于收敛问题,我增加了最大迭代次数并给epsilon 0.05。
> model <- glm(Species ~ ., data = irisdf, family = binomial,control = glm.control(epsilon=0.01,trace=FALSE,maxit = 100))
> summary(model)
Call:
glm(formula = Species ~ ., family = binomial, data = irisdf,
control = glm.control(epsilon = 0.01, trace = FALSE, maxit = 100))
Deviance Residuals:
Min 1Q Median 3Q Max
-0.0102793 -0.0005659 -0.0000052 0.0001438 0.0112531
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.796 704.352 0.003 0.998
Sepal.Length -3.426 215.912 -0.016 0.987
Sepal.Width -4.208 123.513 -0.034 0.973
Petal.Length 7.615 159.478 0.048 0.962
Petal.Width 11.835 285.938 0.041 0.967
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.3863e+02 on 99 degrees of freedom
Residual deviance: 5.3910e-04 on 95 degrees of freedom
AIC: 10.001
Number of Fisher Scoring iterations: 12
#R scores..
> scores = predict(model, newdata = irisdf, type = "response")
> head(scores,5)
1 2 3 4 5
2.844996e-08 4.627411e-07 1.848093e-07 1.818231e-06 2.631029e-08
R 中的分数、截距和系数都完全不同,python.Which 一个是正确的,我想继续 python.Now 混淆哪个结果是准确的。
已更新
问题是沿花瓣宽度变量存在完美分离。换句话说,这个变量可以用来完美地预测给定数据集中的样本是setosa还是versicolor。这打破了 R 中逻辑回归中使用的对数似然最大化估计。问题在于,通过将花瓣宽度的系数取为无穷大,可以将对数似然推得非常高。
一些背景和策略是discussed here。
还有一个很好的thread on CrossValidated讨论策略。
那么为什么 sklearn LogisticRegression 有效呢?因为它雇用了"regularized logistic regression"。正则化会惩罚估计参数的大值。
在下面的示例中,我使用 Firth 的逻辑回归包 logistf 的偏差减少方法来生成收敛模型。
library(logistf)
iris = read.table("path_to _iris.txt", sep="\t", header=TRUE)
iris$Species <- as.factor(iris$Species)
sapply(iris, class)
model1 <- glm(Species ~ ., data = irisdf, family = binomial)
# Does not converge, throws warnings.
model2 <- logistf(Species ~ ., data = irisdf, family = binomial)
# Does converge.
原创
根据 R 解决方案中的 std.error 和 z 值,我认为您的模型规格不正确。接近 0 的 z 值基本上告诉您模型和因变量之间没有相关性。所以这是一个无意义的模型。
我的第一个想法是您需要将该 Species 字段转换为分类变量。在您的示例中,它是 int
类型。尝试使用 as.factor
How to convert integer into categorical data in R?
我在 R 和 Python.But 上对虹膜数据集进行了 运行 逻辑回归,两者都给出了不同的结果(系数、截距和分数)。
#Python codes.
In[23]: iris_df.head(5)
Out[23]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
In[35]: iris_df.shape
Out[35]: (100, 5)
#looking at the levels of the Species dependent variable..
In[25]: iris_df['Species'].unique()
Out[25]: array([0, 1], dtype=int64)
#creating dependent and independent variable datasets..
x = iris_df.ix[:,0:4]
y = iris_df.ix[:,-1]
#modelling starts..
y = np.ravel(y)
logistic = LogisticRegression()
model = logistic.fit(x,y)
#getting the model coefficients..
model_coef= pd.DataFrame(list(zip(x.columns, np.transpose(model.coef_))))
model_intercept = model.intercept_
In[30]: model_coef
Out[36]:
0 1
0 Sepal.Length [-0.402473917528]
1 Sepal.Width [-1.46382924771]
2 Petal.Length [2.23785647964]
3 Petal.Width [1.0000929404]
In[31]: model_intercept
Out[31]: array([-0.25906453])
#scores...
In[34]: logistic.predict_proba(x)
Out[34]:
array([[ 0.9837306 , 0.0162694 ],
[ 0.96407227, 0.03592773],
[ 0.97647105, 0.02352895],
[ 0.95654126, 0.04345874],
[ 0.98534488, 0.01465512],
[ 0.98086592, 0.01913408],
R 代码。
> str(irisdf)
'data.frame': 100 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : int 0 0 0 0 0 0 0 0 0 0 ...
> model <- glm(Species ~ ., data = irisdf, family = binomial)
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
> summary(model)
Call:
glm(formula = Species ~ ., family = binomial, data = irisdf)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.681e-05 -2.110e-08 0.000e+00 2.110e-08 2.006e-05
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.556 601950.324 0 1
Sepal.Length -9.879 194223.245 0 1
Sepal.Width -7.418 92924.451 0 1
Petal.Length 19.054 144515.981 0 1
Petal.Width 25.033 216058.936 0 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.3863e+02 on 99 degrees of freedom
Residual deviance: 1.3166e-09 on 95 degrees of freedom
AIC: 10
Number of Fisher Scoring iterations: 25
由于收敛问题,我增加了最大迭代次数并给epsilon 0.05。
> model <- glm(Species ~ ., data = irisdf, family = binomial,control = glm.control(epsilon=0.01,trace=FALSE,maxit = 100))
> summary(model)
Call:
glm(formula = Species ~ ., family = binomial, data = irisdf,
control = glm.control(epsilon = 0.01, trace = FALSE, maxit = 100))
Deviance Residuals:
Min 1Q Median 3Q Max
-0.0102793 -0.0005659 -0.0000052 0.0001438 0.0112531
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.796 704.352 0.003 0.998
Sepal.Length -3.426 215.912 -0.016 0.987
Sepal.Width -4.208 123.513 -0.034 0.973
Petal.Length 7.615 159.478 0.048 0.962
Petal.Width 11.835 285.938 0.041 0.967
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.3863e+02 on 99 degrees of freedom
Residual deviance: 5.3910e-04 on 95 degrees of freedom
AIC: 10.001
Number of Fisher Scoring iterations: 12
#R scores..
> scores = predict(model, newdata = irisdf, type = "response")
> head(scores,5)
1 2 3 4 5
2.844996e-08 4.627411e-07 1.848093e-07 1.818231e-06 2.631029e-08
R 中的分数、截距和系数都完全不同,python.Which 一个是正确的,我想继续 python.Now 混淆哪个结果是准确的。
已更新 问题是沿花瓣宽度变量存在完美分离。换句话说,这个变量可以用来完美地预测给定数据集中的样本是setosa还是versicolor。这打破了 R 中逻辑回归中使用的对数似然最大化估计。问题在于,通过将花瓣宽度的系数取为无穷大,可以将对数似然推得非常高。
一些背景和策略是discussed here。
还有一个很好的thread on CrossValidated讨论策略。
那么为什么 sklearn LogisticRegression 有效呢?因为它雇用了"regularized logistic regression"。正则化会惩罚估计参数的大值。
在下面的示例中,我使用 Firth 的逻辑回归包 logistf 的偏差减少方法来生成收敛模型。
library(logistf)
iris = read.table("path_to _iris.txt", sep="\t", header=TRUE)
iris$Species <- as.factor(iris$Species)
sapply(iris, class)
model1 <- glm(Species ~ ., data = irisdf, family = binomial)
# Does not converge, throws warnings.
model2 <- logistf(Species ~ ., data = irisdf, family = binomial)
# Does converge.
原创 根据 R 解决方案中的 std.error 和 z 值,我认为您的模型规格不正确。接近 0 的 z 值基本上告诉您模型和因变量之间没有相关性。所以这是一个无意义的模型。
我的第一个想法是您需要将该 Species 字段转换为分类变量。在您的示例中,它是 int
类型。尝试使用 as.factor
How to convert integer into categorical data in R?