Logistic 回归的多个问题(1. 所有 CV 值具有相同的分数,2. 分类报告和准确性不匹配)
Multiple problems with Logistic Regression (1. all CV values have the same score, 2. classification report and accuracy doesn't match)
我对银行贷款数据实施了逻辑回归。
我使用 gridsearchCV 进行超参数调整,并使用多个 kfolds = [3,5,6] 实现逻辑回归
这是我的代码
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#from google.colab import files
import io
import warnings
warnings.filterwarnings('ignore')
#uploaded = files.upload()
df = pd.read_csv('CleanedLoanData13Cols.csv')
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
X = df.drop('loan_status', axis=1, inplace=False)
y = df['loan_status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 4)
parameters = {'penalty': ['l1', 'l2','elasticnet'],
'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
'solver' : ['liblinear', 'newton-cg', 'lbfgs', 'saga', 'sag'],
'multi_class' : ['auto'],
'max_iter' : [5,15,25]
}
import warnings
warnings.filterwarnings("ignore")
cv_folds = [3, 5, 6]
s_scaler = StandardScaler()
#m_scaler = MinMaxScaler()
#r_scaler = RobustScaler()
s_scaled_X_train = s_scaler.fit_transform(X_train)
s_scaled_X_test = s_scaler.transform(X_test)
for x in cv_folds:
logmodel = GridSearchCV(LogisticRegression(random_state = 42), parameters, cv = x, scoring = 'accuracy', refit = True)
logmodel.fit(X_train, y_train)
print('The best score with CV =', x, 'is', logmodel.score(X_test, y_test), 'with parameters =\n\n', logmodel.best_params_, '\n\n')
输出:(第一期:我觉得这不对!如果我错了请纠正我?)
The best score with CV = 3 is 0.929636746271388 with parameters =
{'C': 0.001, 'max_iter': 25, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear'}
The best score with CV = 5 is 0.929636746271388 with parameters =
{'C': 0.001, 'max_iter': 25, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear'}
The best score with CV = 6 is 0.929636746271388 with parameters =
{'C': 0.001, 'max_iter': 25, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear'}
继续
results = logmodel.cv_results_
print(results.get('params'))
print(results.get('mean_test_score'))
输出:
[0.9084348 nan nan 0.8323203 nan 0.83239873
0.83671225 0.8323203 0.8323203 0.8323203 nan nan
nan nan nan 0.91647373 nan nan
0.8323203 nan 0.902435 0.89474906 0.8520445 0.8323203 and so on
继续:
print(results.get('mean_train_score'))
输出:None
print(logmodel.best_params_)
{'C': 0.001, 'max_iter': 25, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear'}
print(logmodel.best_score_)
输出:0.9226303384209481(我认为这里也有问题,因为这和分类报告中的准确性不匹配)
final_model = logmodel.best_estimator_
s_predictions = final_model.predict(s_scaled_X_test)
from sklearn.metrics import classification_report, confusion_matrix, plot_confusion_matrix
print(classification_report(y_test, s_predictions))
print(confusion_matrix(y_test, s_predictions))
输出:此处的精度为 0.62,而顶部为 92
precision recall f1-score support
0 0.88 0.64 0.74 9197
1 0.22 0.53 0.31 1732
accuracy 0.62 10929
macro avg 0.55 0.59 0.53 10929
weighted avg 0.77 0.62 0.67 10929
[[5902 3295]
[ 812 920]]
不知道我哪里做错了?在过去的几个小时里,我一直在努力解决这个问题,但我无法理解我哪里出错了?如果有人对此提出意见,真的很感激吗?
这里的问题是您正在根据未缩放的数据拟合模型 X_train, y_train
。
logmodel.fit(X_train, y_train)
然后您尝试对缩放数据进行预测 s_scaled_X_test
,这解释了这种性能下降。
s_predictions = final_model.predict(s_scaled_X_test)
要解决这个问题,您应该按如下方式使用缩放数据训练您的模型:
logmodel.fit(s_scaled_X_train, y_train)
我对银行贷款数据实施了逻辑回归。 我使用 gridsearchCV 进行超参数调整,并使用多个 kfolds = [3,5,6] 实现逻辑回归 这是我的代码
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#from google.colab import files
import io
import warnings
warnings.filterwarnings('ignore')
#uploaded = files.upload()
df = pd.read_csv('CleanedLoanData13Cols.csv')
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
X = df.drop('loan_status', axis=1, inplace=False)
y = df['loan_status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 4)
parameters = {'penalty': ['l1', 'l2','elasticnet'],
'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
'solver' : ['liblinear', 'newton-cg', 'lbfgs', 'saga', 'sag'],
'multi_class' : ['auto'],
'max_iter' : [5,15,25]
}
import warnings
warnings.filterwarnings("ignore")
cv_folds = [3, 5, 6]
s_scaler = StandardScaler()
#m_scaler = MinMaxScaler()
#r_scaler = RobustScaler()
s_scaled_X_train = s_scaler.fit_transform(X_train)
s_scaled_X_test = s_scaler.transform(X_test)
for x in cv_folds:
logmodel = GridSearchCV(LogisticRegression(random_state = 42), parameters, cv = x, scoring = 'accuracy', refit = True)
logmodel.fit(X_train, y_train)
print('The best score with CV =', x, 'is', logmodel.score(X_test, y_test), 'with parameters =\n\n', logmodel.best_params_, '\n\n')
输出:(第一期:我觉得这不对!如果我错了请纠正我?)
The best score with CV = 3 is 0.929636746271388 with parameters =
{'C': 0.001, 'max_iter': 25, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear'}
The best score with CV = 5 is 0.929636746271388 with parameters =
{'C': 0.001, 'max_iter': 25, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear'}
The best score with CV = 6 is 0.929636746271388 with parameters =
{'C': 0.001, 'max_iter': 25, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear'}
继续
results = logmodel.cv_results_
print(results.get('params'))
print(results.get('mean_test_score'))
输出:
[0.9084348 nan nan 0.8323203 nan 0.83239873
0.83671225 0.8323203 0.8323203 0.8323203 nan nan
nan nan nan 0.91647373 nan nan
0.8323203 nan 0.902435 0.89474906 0.8520445 0.8323203 and so on
继续:
print(results.get('mean_train_score'))
输出:None
print(logmodel.best_params_)
{'C': 0.001, 'max_iter': 25, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear'}
print(logmodel.best_score_)
输出:0.9226303384209481(我认为这里也有问题,因为这和分类报告中的准确性不匹配)
final_model = logmodel.best_estimator_
s_predictions = final_model.predict(s_scaled_X_test)
from sklearn.metrics import classification_report, confusion_matrix, plot_confusion_matrix
print(classification_report(y_test, s_predictions))
print(confusion_matrix(y_test, s_predictions))
输出:此处的精度为 0.62,而顶部为 92
precision recall f1-score support
0 0.88 0.64 0.74 9197
1 0.22 0.53 0.31 1732
accuracy 0.62 10929
macro avg 0.55 0.59 0.53 10929
weighted avg 0.77 0.62 0.67 10929
[[5902 3295]
[ 812 920]]
不知道我哪里做错了?在过去的几个小时里,我一直在努力解决这个问题,但我无法理解我哪里出错了?如果有人对此提出意见,真的很感激吗?
这里的问题是您正在根据未缩放的数据拟合模型 X_train, y_train
。
logmodel.fit(X_train, y_train)
然后您尝试对缩放数据进行预测 s_scaled_X_test
,这解释了这种性能下降。
s_predictions = final_model.predict(s_scaled_X_test)
要解决这个问题,您应该按如下方式使用缩放数据训练您的模型:
logmodel.fit(s_scaled_X_train, y_train)