python 逻辑回归(初学者)
python logistic regression (beginner)
我正在努力使用 python 自学一些逻辑回归。我正在尝试应用演练中的课程 here to the small dataset in the wikipedia entryhere。
好像不太对劲。 Wikipedia 和 Excel 求解器(使用 this video 中的方法验证)给出截距 -4.0777 和系数 1.5046,但我从 github 示例中构建的代码分别输出 -0.924200 和 0.756024。
我尝试使用的代码如下。有什么明显的错误吗?
import numpy as np
import pandas as pd
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
3.5,4.0,4.25,4.5,4.75,5.0,5.5]
y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
zipped = list(zip(X,y))
df = pd.DataFrame(zipped,columns = ['study_hrs','p_or_f'])
y, X = dmatrices('p_or_f ~ study_hrs',
df, return_type="dataframe")
y = np.ravel(y)
model = LogisticRegression()
model = model.fit(X,y)
print(pd.DataFrame(np.transpose(model.coef_),X.columns))
>>>
0
Intercept -0.924200
study_hrs 0.756024
解决方案
只需将模型创建行更改为
model = LogisticRegression(C=100000, fit_intercept=False)
问题分析
默认情况下,sklearn求解正则化LogisticRegression,拟合强度C=1
(小C-大正则化,大C-小正则化)。
This class implements regularized logistic regression using the
liblinear library, newton-cg and lbfgs solvers. It can handle both
dense and sparse input. Use C-ordered arrays or CSR matrices
containing 64-bit floats for optimal performance; any other input
format will be converted (and copied).
因此要获得他们的模型你应该适合
model = LogisticRegression(C=1000000)
这给出了
Intercept -2.038853 # this is actually half the intercept
study_hrs 1.504643 # this is correct
此外,问题还在于您在 patsy 中处理数据的方式,请参阅简化的正确示例
import numpy as np
from sklearn.linear_model import LogisticRegression
X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
3.5,4.0,4.25,4.5,4.75,5.0,5.5]
y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
X = np.array([[x] for x in X])
y = np.ravel(y)
model = LogisticRegression(C=1000000.)
model = model.fit(X,y)
print('coef', model.coef_)
print('intercept', model.intercept_)
给予
coef [[ 1.50464059]]
intercept [-4.07769916]
到底是什么问题?当您执行 dmatrices
时,默认情况下会将您的输入数据嵌入一列(偏差)
X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
3.5,4.0,4.25,4.5,4.75,5.0,5.5]
y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
zipped = list(zip(X,y))
df = pd.DataFrame(zipped,columns = ['study_hrs','p_or_f'])
y, X = dmatrices('p_or_f ~ study_hrs',
df, return_type="dataframe")
print(X)
这导致
Intercept study_hrs
0 1 0.50
1 1 0.75
2 1 1.00
3 1 1.25
4 1 1.50
5 1 1.75
6 1 1.75
7 1 2.00
8 1 2.25
9 1 2.50
10 1 2.75
11 1 3.00
12 1 3.25
13 1 3.50
14 1 4.00
15 1 4.25
16 1 4.50
17 1 4.75
18 1 5.00
19 1 5.50
这就是为什么产生的偏差只是真实偏差的 一半 - scikit 学习还添加了一列 1... 所以你现在有 两个偏差,因此最佳解决方案是给每个偏差一半的权重。
那你能做什么?
- 不要这样使用patsy
- 禁止patsy添加偏见
- 告诉 sklearn 不要添加偏差
.
import numpy as np
import pandas as pd
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
3.5,4.0,4.25,4.5,4.75,5.0,5.5]
y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
zipped = list(zip(X,y))
df = pd.DataFrame(zipped,columns = ['study_hrs','p_or_f'])
y, X = dmatrices('p_or_f ~ study_hrs',
df, return_type="dataframe")
y = np.ravel(y)
model = LogisticRegression(C=100000, fit_intercept=False)
model = model.fit(X,y)
print(pd.DataFrame(np.transpose(model.coef_),X.columns))
给予
Intercept -4.077571
study_hrs 1.504597
随心所欲
我正在努力使用 python 自学一些逻辑回归。我正在尝试应用演练中的课程 here to the small dataset in the wikipedia entryhere。
好像不太对劲。 Wikipedia 和 Excel 求解器(使用 this video 中的方法验证)给出截距 -4.0777 和系数 1.5046,但我从 github 示例中构建的代码分别输出 -0.924200 和 0.756024。
我尝试使用的代码如下。有什么明显的错误吗?
import numpy as np
import pandas as pd
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
3.5,4.0,4.25,4.5,4.75,5.0,5.5]
y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
zipped = list(zip(X,y))
df = pd.DataFrame(zipped,columns = ['study_hrs','p_or_f'])
y, X = dmatrices('p_or_f ~ study_hrs',
df, return_type="dataframe")
y = np.ravel(y)
model = LogisticRegression()
model = model.fit(X,y)
print(pd.DataFrame(np.transpose(model.coef_),X.columns))
>>>
0
Intercept -0.924200
study_hrs 0.756024
解决方案
只需将模型创建行更改为
model = LogisticRegression(C=100000, fit_intercept=False)
问题分析
默认情况下,sklearn求解正则化LogisticRegression,拟合强度C=1
(小C-大正则化,大C-小正则化)。
This class implements regularized logistic regression using the liblinear library, newton-cg and lbfgs solvers. It can handle both dense and sparse input. Use C-ordered arrays or CSR matrices containing 64-bit floats for optimal performance; any other input format will be converted (and copied).
因此要获得他们的模型你应该适合
model = LogisticRegression(C=1000000)
这给出了
Intercept -2.038853 # this is actually half the intercept
study_hrs 1.504643 # this is correct
此外,问题还在于您在 patsy 中处理数据的方式,请参阅简化的正确示例
import numpy as np
from sklearn.linear_model import LogisticRegression
X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
3.5,4.0,4.25,4.5,4.75,5.0,5.5]
y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
X = np.array([[x] for x in X])
y = np.ravel(y)
model = LogisticRegression(C=1000000.)
model = model.fit(X,y)
print('coef', model.coef_)
print('intercept', model.intercept_)
给予
coef [[ 1.50464059]]
intercept [-4.07769916]
到底是什么问题?当您执行 dmatrices
时,默认情况下会将您的输入数据嵌入一列(偏差)
X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
3.5,4.0,4.25,4.5,4.75,5.0,5.5]
y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
zipped = list(zip(X,y))
df = pd.DataFrame(zipped,columns = ['study_hrs','p_or_f'])
y, X = dmatrices('p_or_f ~ study_hrs',
df, return_type="dataframe")
print(X)
这导致
Intercept study_hrs
0 1 0.50
1 1 0.75
2 1 1.00
3 1 1.25
4 1 1.50
5 1 1.75
6 1 1.75
7 1 2.00
8 1 2.25
9 1 2.50
10 1 2.75
11 1 3.00
12 1 3.25
13 1 3.50
14 1 4.00
15 1 4.25
16 1 4.50
17 1 4.75
18 1 5.00
19 1 5.50
这就是为什么产生的偏差只是真实偏差的 一半 - scikit 学习还添加了一列 1... 所以你现在有 两个偏差,因此最佳解决方案是给每个偏差一半的权重。
那你能做什么?
- 不要这样使用patsy
- 禁止patsy添加偏见
- 告诉 sklearn 不要添加偏差
.
import numpy as np
import pandas as pd
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
3.5,4.0,4.25,4.5,4.75,5.0,5.5]
y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
zipped = list(zip(X,y))
df = pd.DataFrame(zipped,columns = ['study_hrs','p_or_f'])
y, X = dmatrices('p_or_f ~ study_hrs',
df, return_type="dataframe")
y = np.ravel(y)
model = LogisticRegression(C=100000, fit_intercept=False)
model = model.fit(X,y)
print(pd.DataFrame(np.transpose(model.coef_),X.columns))
给予
Intercept -4.077571
study_hrs 1.504597
随心所欲