使用 GridSearchCV 进行 Logistic 回归时的精度计算警告
Precision calculation warning when using GridSearchCV for Logistic Regression
我正在尝试 运行 GridSearchCV 与 LogisticRegression 估计器并记录模型准确度、精确度、召回率、f1 指标。
但是,我在精度指标上收到以下错误:
Precision is ill-defined and being set to 0.0 due to no predicted samples.
Use `zero_division` parameter to control this behavior
我明白为什么我会收到错误消息,因为在 Kfold 拆分中没有输出值等于 1 的预测。但是我不明白如何在 GridSearchCV(logistic_reg 变量)中将“zero_divison”具体设置为 1。
原代码
logistic_reg = GridSearchCV(estimator=LogisticRegression(penalty="l1", random_state=42, max_iter=10000), param_grid={
"C": [1e-4, 5e-4, 1e-3, 5e-3, 1e-2, 5e-2, 1e-1, 5e-1, 1, 5, 10, 20],
"solver": ["liblinear", "saga"]
}, scoring=["accuracy", "precision", "recall", "f1"], cv=StratifiedKFold(n_splits=10), refit="accuracy")
logistic_reg_X_train = self.X_train.copy()
logistic_reg_X_train.drop(self.columns_removed, axis=1, inplace=True)
logistic_reg.fit(logistic_reg_X_train, self.y_train)
logistic_reg_results = pd.DataFrame(logistic_reg.cv_results_)
我尝试将“精度”更改为 precision_score(zero_division=1) 但这给了我另一个错误 (missing 2 required positional arguments: 'y_true' and 'y_pred'
)。我再次理解这一点,但在应用 fit 方法之前未定义 2 个缺少的参数。
如何为精度得分指标指定 1zero_division
参数?
编辑
我不明白的是,我在 train_test_split 方法中对 y 数据进行了分层,并在 GridSearchCV 中使用了 StratifedKFold。我的理解是 train/test 数据将具有相同的 y 值分割比例,并且在交叉验证期间也应该发生同样的情况。这意味着在 gridsearchcv 样本中,数据的 y 值应为 0 和 1,因此精度不能等于 0(模型将能够计算 TP 和 FP,因为样本测试数据包含 y 等于 1 的样本)。我不确定从这里到哪里去。
通过进一步阅读此问题,我的理解是错误的发生是因为并非我 y_test 中的所有标签都出现在我的 y_pred 中。我的数据不是这种情况。
我使用了 G.Anderson 的评论来删除警告(但它没有回答我的问题)
创建了新的 custom_scorer 对象
创建了 customer_scoring 字典
更新了 GridSearchCV 评分和改装参数
from sklearn.metrics import precision_score, make_scorer
precision_scorer = make_scorer(precision_score, zero_division=0)
custom_scoring = {"accuracy": "accuracy", "precision": precision_scorer, "recall": "recall", "f1": "f1"}
logistic_reg = GridSearchCV(estimator=LogisticRegression(penalty="l1", random_state=42, max_iter=10000), param_grid={
"C": [1e-4, 5e-4, 1e-3, 5e-3, 1e-2, 5e-2, 1e-1, 5e-1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20],
"solver": ["liblinear", "saga"]
}, scoring=custom_scoring, cv=StratifiedKFold(n_splits=10), refit="accuracy")
编辑 - 对上述问题的回答
我使用 GridSearchCV 找到了模型的最佳超参数。为了查看每个拆分的模型指标,我创建了一个具有最佳超参数的 StratifedKFold 估计器,然后自行进行交叉验证。这没有给我任何精确的警告信息。我不知道为什么 GridSearchCV 会给我警告,但至少这种方式有效!!!
注意:我从下面的方法得到的结果和上面问题中的GridSearchCV是一样的。
skf = StratifiedKFold(n_splits=10)
logistic_reg_class_skf = LogisticRegression(penalty="l1", max_iter=10000, random_state=42, C=5, solver="liblinear")
logistic_reg_class_score = []
for train, test in skf.split(logistic_reg_class_X_train, self.y_train):
logistic_reg_class_skf_X_train = logistic_reg_class_X_train.iloc[train]
logistic_reg_class_skf_X_test = logistic_reg_class_X_train.iloc[test]
logistic_reg_class_skf_y_train = self.y_train.iloc[train]
logistic_reg_class_skf_y_test = self.y_train.iloc[test]
logistic_reg_class_skf.fit(logistic_reg_class_skf_X_train, logistic_reg_class_skf_y_train)
logistic_reg_skf_y_pred = logistic_reg_class_skf.predict(logistic_reg_class_skf_X_test)
skf_accuracy_score = metrics.accuracy_score(logistic_reg_class_skf_y_test, logistic_reg_skf_y_pred)
skf_precision_score = metrics.precision_score(logistic_reg_class_skf_y_test, logistic_reg_skf_y_pred)
skf_recall_score = metrics.recall_score(logistic_reg_class_skf_y_test, logistic_reg_skf_y_pred)
skf_f1_score = metrics.f1_score(logistic_reg_class_skf_y_test, logistic_reg_skf_y_pred)
logistic_reg_class_score.append([skf_accuracy_score, skf_precision_score, skf_recall_score, skf_f1_score])
classification_results = pd.DataFrame({"Algorithm": ["Logistic Reg Train"], "Accuracy": [0.0], "Precision": [0.0],
"Recall": [0.0], "F1 Score": [0.0]})
for i in range (0, 10):
classification_results.loc[i] = ["Logistic Reg Train", logistic_reg_class_score[i][0], logistic_reg_class_score[i][1],
logistic_reg_class_score[2][0], logistic_reg_class_score[3][0]]
我正在尝试 运行 GridSearchCV 与 LogisticRegression 估计器并记录模型准确度、精确度、召回率、f1 指标。
但是,我在精度指标上收到以下错误:
Precision is ill-defined and being set to 0.0 due to no predicted samples.
Use `zero_division` parameter to control this behavior
我明白为什么我会收到错误消息,因为在 Kfold 拆分中没有输出值等于 1 的预测。但是我不明白如何在 GridSearchCV(logistic_reg 变量)中将“zero_divison”具体设置为 1。
原代码
logistic_reg = GridSearchCV(estimator=LogisticRegression(penalty="l1", random_state=42, max_iter=10000), param_grid={
"C": [1e-4, 5e-4, 1e-3, 5e-3, 1e-2, 5e-2, 1e-1, 5e-1, 1, 5, 10, 20],
"solver": ["liblinear", "saga"]
}, scoring=["accuracy", "precision", "recall", "f1"], cv=StratifiedKFold(n_splits=10), refit="accuracy")
logistic_reg_X_train = self.X_train.copy()
logistic_reg_X_train.drop(self.columns_removed, axis=1, inplace=True)
logistic_reg.fit(logistic_reg_X_train, self.y_train)
logistic_reg_results = pd.DataFrame(logistic_reg.cv_results_)
我尝试将“精度”更改为 precision_score(zero_division=1) 但这给了我另一个错误 (missing 2 required positional arguments: 'y_true' and 'y_pred'
)。我再次理解这一点,但在应用 fit 方法之前未定义 2 个缺少的参数。
如何为精度得分指标指定 1zero_division
参数?
编辑
我不明白的是,我在 train_test_split 方法中对 y 数据进行了分层,并在 GridSearchCV 中使用了 StratifedKFold。我的理解是 train/test 数据将具有相同的 y 值分割比例,并且在交叉验证期间也应该发生同样的情况。这意味着在 gridsearchcv 样本中,数据的 y 值应为 0 和 1,因此精度不能等于 0(模型将能够计算 TP 和 FP,因为样本测试数据包含 y 等于 1 的样本)。我不确定从这里到哪里去。
通过进一步阅读此问题,我的理解是错误的发生是因为并非我 y_test 中的所有标签都出现在我的 y_pred 中。我的数据不是这种情况。
我使用了 G.Anderson 的评论来删除警告(但它没有回答我的问题)
创建了新的 custom_scorer 对象
创建了 customer_scoring 字典
更新了 GridSearchCV 评分和改装参数
from sklearn.metrics import precision_score, make_scorer precision_scorer = make_scorer(precision_score, zero_division=0) custom_scoring = {"accuracy": "accuracy", "precision": precision_scorer, "recall": "recall", "f1": "f1"} logistic_reg = GridSearchCV(estimator=LogisticRegression(penalty="l1", random_state=42, max_iter=10000), param_grid={ "C": [1e-4, 5e-4, 1e-3, 5e-3, 1e-2, 5e-2, 1e-1, 5e-1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20], "solver": ["liblinear", "saga"] }, scoring=custom_scoring, cv=StratifiedKFold(n_splits=10), refit="accuracy")
编辑 - 对上述问题的回答
我使用 GridSearchCV 找到了模型的最佳超参数。为了查看每个拆分的模型指标,我创建了一个具有最佳超参数的 StratifedKFold 估计器,然后自行进行交叉验证。这没有给我任何精确的警告信息。我不知道为什么 GridSearchCV 会给我警告,但至少这种方式有效!!!
注意:我从下面的方法得到的结果和上面问题中的GridSearchCV是一样的。
skf = StratifiedKFold(n_splits=10)
logistic_reg_class_skf = LogisticRegression(penalty="l1", max_iter=10000, random_state=42, C=5, solver="liblinear")
logistic_reg_class_score = []
for train, test in skf.split(logistic_reg_class_X_train, self.y_train):
logistic_reg_class_skf_X_train = logistic_reg_class_X_train.iloc[train]
logistic_reg_class_skf_X_test = logistic_reg_class_X_train.iloc[test]
logistic_reg_class_skf_y_train = self.y_train.iloc[train]
logistic_reg_class_skf_y_test = self.y_train.iloc[test]
logistic_reg_class_skf.fit(logistic_reg_class_skf_X_train, logistic_reg_class_skf_y_train)
logistic_reg_skf_y_pred = logistic_reg_class_skf.predict(logistic_reg_class_skf_X_test)
skf_accuracy_score = metrics.accuracy_score(logistic_reg_class_skf_y_test, logistic_reg_skf_y_pred)
skf_precision_score = metrics.precision_score(logistic_reg_class_skf_y_test, logistic_reg_skf_y_pred)
skf_recall_score = metrics.recall_score(logistic_reg_class_skf_y_test, logistic_reg_skf_y_pred)
skf_f1_score = metrics.f1_score(logistic_reg_class_skf_y_test, logistic_reg_skf_y_pred)
logistic_reg_class_score.append([skf_accuracy_score, skf_precision_score, skf_recall_score, skf_f1_score])
classification_results = pd.DataFrame({"Algorithm": ["Logistic Reg Train"], "Accuracy": [0.0], "Precision": [0.0],
"Recall": [0.0], "F1 Score": [0.0]})
for i in range (0, 10):
classification_results.loc[i] = ["Logistic Reg Train", logistic_reg_class_score[i][0], logistic_reg_class_score[i][1],
logistic_reg_class_score[2][0], logistic_reg_class_score[3][0]]