Logistic 回归的多个问题(1. 所有 CV 值具有相同的分数,2. 分类报告和准确性不匹配)

Multiple problems with Logistic Regression (1. all CV values have the same score, 2. classification report and accuracy doesn't match)

我对银行贷款数据实施了逻辑回归。 我使用 gridsearchCV 进行超参数调整,并使用多个 kfolds = [3,5,6] 实现逻辑回归 这是我的代码

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#from google.colab import files
import io

import warnings
warnings.filterwarnings('ignore')
#uploaded = files.upload()

df = pd.read_csv('CleanedLoanData13Cols.csv')

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

X = df.drop('loan_status', axis=1, inplace=False)
y = df['loan_status']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 4)
parameters = {'penalty': ['l1', 'l2','elasticnet'],
                  'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
                  'solver' : ['liblinear', 'newton-cg', 'lbfgs', 'saga', 'sag'],
                  'multi_class' : ['auto'],
                  'max_iter'    : [5,15,25]
                 }

import warnings
warnings.filterwarnings("ignore")

cv_folds = [3, 5, 6]
s_scaler = StandardScaler()
#m_scaler = MinMaxScaler()
#r_scaler = RobustScaler()
s_scaled_X_train = s_scaler.fit_transform(X_train)
s_scaled_X_test = s_scaler.transform(X_test)

for x in cv_folds:
    logmodel = GridSearchCV(LogisticRegression(random_state = 42), parameters, cv = x, scoring = 'accuracy', refit = True)
    logmodel.fit(X_train, y_train)
    
    print('The best score with CV =', x, 'is', logmodel.score(X_test, y_test), 'with parameters =\n\n', logmodel.best_params_, '\n\n')

输出:(第一期:我觉得这不对!如果我错了请纠正我?)

The best score with CV = 3 is 0.929636746271388 with parameters =

 {'C': 0.001, 'max_iter': 25, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear'} 

The best score with CV = 5 is 0.929636746271388 with parameters =

 {'C': 0.001, 'max_iter': 25, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear'} 


The best score with CV = 6 is 0.929636746271388 with parameters =

 {'C': 0.001, 'max_iter': 25, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear'} 

继续

results = logmodel.cv_results_

print(results.get('params'))

print(results.get('mean_test_score'))

输出:

[0.9084348         nan        nan 0.8323203         nan 0.83239873
 0.83671225 0.8323203  0.8323203  0.8323203         nan        nan
        nan        nan        nan 0.91647373        nan        nan
 0.8323203         nan 0.902435   0.89474906 0.8520445  0.8323203 and so on

继续:

print(results.get('mean_train_score'))

输出:None

print(logmodel.best_params_)

{'C': 0.001, 'max_iter': 25, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear'}

print(logmodel.best_score_)

输出:0.9226303384209481(我认为这里也有问题,因为这和分类报告中的准确性不匹配)

final_model = logmodel.best_estimator_

s_predictions = final_model.predict(s_scaled_X_test)

from sklearn.metrics import classification_report, confusion_matrix, plot_confusion_matrix

print(classification_report(y_test, s_predictions))
print(confusion_matrix(y_test, s_predictions))

输出:此处的精度为 0.62,而顶部为 92

precision    recall  f1-score   support

           0       0.88      0.64      0.74      9197
           1       0.22      0.53      0.31      1732

    accuracy                           0.62     10929
   macro avg       0.55      0.59      0.53     10929
weighted avg       0.77      0.62      0.67     10929

[[5902 3295]
 [ 812  920]]

不知道我哪里做错了?在过去的几个小时里,我一直在努力解决这个问题,但我无法理解我哪里出错了?如果有人对此提出意见,真的很感激吗?

这里的问题是您正在根据未缩放的数据拟合模型 X_train, y_train

logmodel.fit(X_train, y_train)

然后您尝试对缩放数据进行预测 s_scaled_X_test,这解释了这种性能下降。

s_predictions = final_model.predict(s_scaled_X_test)

要解决这个问题,您应该按如下方式使用缩放数据训练您的模型:

logmodel.fit(s_scaled_X_train, y_train)