使用 RFECV 和 GridSearch 部分手动应用管道和嵌套交叉验证
Applying Pipeline and Nested Cross validation partly manually with RFECV and GridSearch
我正在尝试以比在 cross_val_score
中使用 GridSearchCV
更手动的方式实现嵌套交叉验证,因此我可以更好地控制我正在做的事情,请考虑数据泄露,看看我的变量和超参数是如何变化的并记录下来,也可以更清楚地了解一切是如何工作的。
我想问一下你觉得下面这个实现是否合适。
处理管道:数据泄漏
分类变量
cat_pipe = Pipeline([("encoder",OneHotEncoder(drop="first"))])
可以分类的数值特征
num_pipe = Pipeline([("encoder",OneHotEncoder(drop="first"))])
连续变量
num_cont_pipe = Pipeline([("scaler",StandardScaler())])
preprocessor = ColumnTransformer(transformers=[('cat', cat_pipe, cat_var),
('num', num_pipe, num_cat_vars), "num_cont",num_cont_pipe,num_cont)],n_jobs=-1)#,sparse_threshold=0)
嵌套交叉验证
cv_outer = StratifiedKFold(n_splits=5, random_state=0)
cv_inner = StratifiedKFold(n_splits=10,random_state=0) # RepeatedStratifiedKFold(n_repeats=3,n_splits=10,random_state=0)
估算器和模型
estimator = RandomForestClassifier(class_weight="balanced")
model = XGBClassifier(eval_metric="logloss")
特征选择和超参数优化
rfe = RFECV(estimator,cv=cv_inner, scoring="roc_auc",n_jobs=-1)
实施
for train_ix, test_ix in cv_outer.split(X,y):
time_start_outer = time.time()
X_train = X.iloc[train_ix,:]
X_test = X.iloc[test_ix,:]
y_train = y.iloc[train_ix]
y_test = y.iloc[test_ix]
# Data Prep
X_train_enc = preprocessor.fit_transform(X_train)
X_test_enc = preprocessor.transform(X_test)
# Feature Selection
X_train_enc_var = rfe.fit(X_train_enc,y_train)
print(f"The number of features selected is {X_train_enc_var.n_features_}", sep="/n")
X_train_enc_var = rfe.transform(X_train_enc)
X_test_enc_var = rfe.transform(X_test_enc)
# Hyperparameter tuning
counter = Counter(y_train)
weight = counter[0] / counter[1]
hyper_params = {
# XGBClassifier
"scale_pos_weight":[1,2,3,4,5,6,weight,7,8,9,10,11,12]
}
grid_search = GridSearchCV(model,cv=cv_inner,param_grid=hyper_params,n_jobs=-1,scoring="roc_auc")
X_train_enc_var_hyp = grid_search.fit(X_train_enc_var,y_train)
best_model = X_train_enc_var_hyp.best_estimator_
yhat = best_model.predict(X_test_enc_var)
# evaluate the model
score = roc_auc_score(y_test, yhat)
# store the result
outer_results.append(score)
outer_params.append(X_train_enc_var_hyp.best_params_)
# report progress
print(f' test_score= {score}, validation_score= {X_train_enc_var_hyp.best_score_}')
# summarize the estimated performance of the model
print(f'best_score: {np.mean(outer_results)}, {np.std(outer_results)}')
我注意到三件事。
首先,不重要:
# Feature Selection
X_train_enc_var = rfe.fit(X_train_enc,y_train)
print(f"The number of features selected is {X_train_enc_var.n_features_}", sep="/n")
X_train_enc_var = rfe.transform(X_train_enc)
X_test_enc_var = rfe.transform(X_test_enc)
您将拟合的 rfe
保存到 X_train_enc_var
,然后两行之后将其覆盖为转换后的数据集。这一切都按照你想要的方式进行,但也许对变量名更诚实:
X_train_enc_var = rfe.fit_transform(X_train_enc,y_train)
X_test_enc_var = rfe.transform(X_test_enc)
print(f"The number of features selected is {rfe.n_features_}", sep="/n")
其次,您使用相同的 cv_inner
,分别用于递归特征消除和网格搜索。这意味着所选特征的数量具有来自(内部)测试折叠的信息,因此网格搜索分数可能存在乐观偏差。这可能没什么大不了的,因为您只关心搜索的相对分数。但是 也许 某些超参数组合会在不同数量的特征下做得更好,因此会受到惩罚。
最后,最严重但也最容易解决的问题是:您调用 best_model.predict(...)
,但由于您正在使用 roc_auc_score
,因此您需要 best_model.predict_proba(...)[:, 1]
来获得概率分数(对于积极 class).
另一个建议:由于您正在进行大量计算,请保存更多信息:例如 rfe 的 grid_scores_
、网格搜索的 cv_results_
。
我正在尝试以比在 cross_val_score
中使用 GridSearchCV
更手动的方式实现嵌套交叉验证,因此我可以更好地控制我正在做的事情,请考虑数据泄露,看看我的变量和超参数是如何变化的并记录下来,也可以更清楚地了解一切是如何工作的。
我想问一下你觉得下面这个实现是否合适。
处理管道:数据泄漏
分类变量
cat_pipe = Pipeline([("encoder",OneHotEncoder(drop="first"))])
可以分类的数值特征
num_pipe = Pipeline([("encoder",OneHotEncoder(drop="first"))])
连续变量
num_cont_pipe = Pipeline([("scaler",StandardScaler())])
preprocessor = ColumnTransformer(transformers=[('cat', cat_pipe, cat_var),
('num', num_pipe, num_cat_vars), "num_cont",num_cont_pipe,num_cont)],n_jobs=-1)#,sparse_threshold=0)
嵌套交叉验证
cv_outer = StratifiedKFold(n_splits=5, random_state=0)
cv_inner = StratifiedKFold(n_splits=10,random_state=0) # RepeatedStratifiedKFold(n_repeats=3,n_splits=10,random_state=0)
估算器和模型
estimator = RandomForestClassifier(class_weight="balanced")
model = XGBClassifier(eval_metric="logloss")
特征选择和超参数优化
rfe = RFECV(estimator,cv=cv_inner, scoring="roc_auc",n_jobs=-1)
实施
for train_ix, test_ix in cv_outer.split(X,y):
time_start_outer = time.time()
X_train = X.iloc[train_ix,:]
X_test = X.iloc[test_ix,:]
y_train = y.iloc[train_ix]
y_test = y.iloc[test_ix]
# Data Prep
X_train_enc = preprocessor.fit_transform(X_train)
X_test_enc = preprocessor.transform(X_test)
# Feature Selection
X_train_enc_var = rfe.fit(X_train_enc,y_train)
print(f"The number of features selected is {X_train_enc_var.n_features_}", sep="/n")
X_train_enc_var = rfe.transform(X_train_enc)
X_test_enc_var = rfe.transform(X_test_enc)
# Hyperparameter tuning
counter = Counter(y_train)
weight = counter[0] / counter[1]
hyper_params = {
# XGBClassifier
"scale_pos_weight":[1,2,3,4,5,6,weight,7,8,9,10,11,12]
}
grid_search = GridSearchCV(model,cv=cv_inner,param_grid=hyper_params,n_jobs=-1,scoring="roc_auc")
X_train_enc_var_hyp = grid_search.fit(X_train_enc_var,y_train)
best_model = X_train_enc_var_hyp.best_estimator_
yhat = best_model.predict(X_test_enc_var)
# evaluate the model
score = roc_auc_score(y_test, yhat)
# store the result
outer_results.append(score)
outer_params.append(X_train_enc_var_hyp.best_params_)
# report progress
print(f' test_score= {score}, validation_score= {X_train_enc_var_hyp.best_score_}')
# summarize the estimated performance of the model
print(f'best_score: {np.mean(outer_results)}, {np.std(outer_results)}')
我注意到三件事。
首先,不重要:
# Feature Selection
X_train_enc_var = rfe.fit(X_train_enc,y_train)
print(f"The number of features selected is {X_train_enc_var.n_features_}", sep="/n")
X_train_enc_var = rfe.transform(X_train_enc)
X_test_enc_var = rfe.transform(X_test_enc)
您将拟合的 rfe
保存到 X_train_enc_var
,然后两行之后将其覆盖为转换后的数据集。这一切都按照你想要的方式进行,但也许对变量名更诚实:
X_train_enc_var = rfe.fit_transform(X_train_enc,y_train)
X_test_enc_var = rfe.transform(X_test_enc)
print(f"The number of features selected is {rfe.n_features_}", sep="/n")
其次,您使用相同的 cv_inner
,分别用于递归特征消除和网格搜索。这意味着所选特征的数量具有来自(内部)测试折叠的信息,因此网格搜索分数可能存在乐观偏差。这可能没什么大不了的,因为您只关心搜索的相对分数。但是 也许 某些超参数组合会在不同数量的特征下做得更好,因此会受到惩罚。
最后,最严重但也最容易解决的问题是:您调用 best_model.predict(...)
,但由于您正在使用 roc_auc_score
,因此您需要 best_model.predict_proba(...)[:, 1]
来获得概率分数(对于积极 class).
另一个建议:由于您正在进行大量计算,请保存更多信息:例如 rfe 的 grid_scores_
、网格搜索的 cv_results_
。