为什么我的 CatBoost 拟合指标与 sklearn 评估指标不同?
Why do my CatBoost fit metrics are different than the sklearn evaluation metrics?
我仍然不确定这应该是这个论坛或交叉验证的问题,但我会尝试这个,因为它更多地是关于代码的输出而不是技术本身。事情是这样的,我是 运行 CatBoost 分类器,就像这样:
# import libraries
import pandas as pd
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.model_selection import train_test_split
# import data
train = pd.read_csv("train.csv")
# get features and label
X = train[["Pclass", "Sex", "SibSp", "Parch", "Fare"]]
y = train[["Survived"]]
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# model parameters
model_cb = CatBoostClassifier(
cat_features=["Pclass", "Sex"],
loss_function="Logloss",
eval_metric="AUC",
learning_rate=0.1,
iterations=500,
od_type = "Iter",
od_wait = 200
)
# fit model
model_cb.fit(
X_train,
y_train,
plot=True,
eval_set=(X_test, y_test),
verbose=50,
)
y_pred = model_cb.predict(X_test)
print(f1_score(y_test, y_pred, average="macro"))
print(roc_auc_score(y_test, y_pred))
我使用的数据框来自泰坦尼克号比赛 (link)。
问题是 model_cb.fit 步骤显示 AUC 为 0.87,但最后一行,来自 sklearn 的 roc_auc_score 显示 AUC 为 0.73,即低得多。据我了解,CatBoost 的 AUC 应该已经在测试数据集上了。
关于这里的问题以及我该如何解决的任何想法?
ROC 曲线需要预测概率或某种其他类型的置信度度量,而不是硬 class 预测。使用
y_pred = model_cb.predict_proba(X_test)[:, 1]
见Scikit-learn : roc_auc_score and Why does roc_curve return only 3 values?。
我仍然不确定这应该是这个论坛或交叉验证的问题,但我会尝试这个,因为它更多地是关于代码的输出而不是技术本身。事情是这样的,我是 运行 CatBoost 分类器,就像这样:
# import libraries
import pandas as pd
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.model_selection import train_test_split
# import data
train = pd.read_csv("train.csv")
# get features and label
X = train[["Pclass", "Sex", "SibSp", "Parch", "Fare"]]
y = train[["Survived"]]
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# model parameters
model_cb = CatBoostClassifier(
cat_features=["Pclass", "Sex"],
loss_function="Logloss",
eval_metric="AUC",
learning_rate=0.1,
iterations=500,
od_type = "Iter",
od_wait = 200
)
# fit model
model_cb.fit(
X_train,
y_train,
plot=True,
eval_set=(X_test, y_test),
verbose=50,
)
y_pred = model_cb.predict(X_test)
print(f1_score(y_test, y_pred, average="macro"))
print(roc_auc_score(y_test, y_pred))
我使用的数据框来自泰坦尼克号比赛 (link)。
问题是 model_cb.fit 步骤显示 AUC 为 0.87,但最后一行,来自 sklearn 的 roc_auc_score 显示 AUC 为 0.73,即低得多。据我了解,CatBoost 的 AUC 应该已经在测试数据集上了。
关于这里的问题以及我该如何解决的任何想法?
ROC 曲线需要预测概率或某种其他类型的置信度度量,而不是硬 class 预测。使用
y_pred = model_cb.predict_proba(X_test)[:, 1]
见Scikit-learn : roc_auc_score and Why does roc_curve return only 3 values?。