在 cross_validate() 函数中使用 Pipeline 来测试不同的 ML 算法
Using Pipeline in a cross_validate() function for testing different ML algorithms
我有一个包含 17 个特征 (x) 和二进制分类结果 (y) 的数据集。我已经准备好数据集并对其执行 train_test_split()
。我正在使用以下脚本对数据集上的 运行 不同 ML 算法进行比较:
def run_exps(X_train: pd.DataFrame , y_train: pd.DataFrame, X_test: pd.DataFrame, y_test: pd.DataFrame) -> pd.DataFrame:
# Lightweight script to test many models and find winners
# :param X_train: training split
# :param y_train: training target vector
# :param X_test: test split
# :param y_test: test target vector
# :return: DataFrame of predictions
models = [
('LogReg', LogisticRegression()),
('RF', RandomForestClassifier()),
('KNN - Euclidean', KNeighborsClassifier(metric='euclidean')),
('SVM', SVC()),
('XGB', XGBClassifier(use_label_encoder =False, eval_metric='error'))
]
names = []
scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc']
# For Loop that takes each model and perform training, cross validation, prediction and evaluation
for name, model in models:
# Making pipleline that normalize, oversmaple the dataset
pipe = Pipeline([
('normalization', MinMaxScaler()),
('oversampling', SMOTE())
])
kfold = StratifiedKFold(n_splits=5)
# How can I call the pipeline inside the cross_validate() Function ?
cv_results = cross_validate(model, X_train, y_train, cv=kfold, scoring=scoring, verbose=3)
clf = model.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('''
{}
{}
{}
''' .format(name, classification_report(y_test, y_pred), confusion_matrix(y_test, y_pred)))
names.append(name)
我注意到在 运行 脚本之前需要对我使用的数据进行归一化和过采样。
但是,由于我在脚本中使用了 cross_validate()
函数,因此我需要在每次折叠时执行归一化和过采样。
为了做到这一点,我在 for 循环(采用每个模型并执行训练、交叉验证、预测和评估)内创建了一个管道(对数据集进行标准化和过采样),但我不确定如何调用管道,因为 cross_validate()
中的 estimator
参数已经采用 model
变量来执行基于它的预测。
遇到这种情况我该怎么办?
您可以将您的模型集成到您的管道中,然后按如下方式在您的管道上调用 cross_validate
:
pipe = Pipeline([
('normalization', MinMaxScaler()),
('oversampling', SMOTE()),
('name', model)
])
cv_results = cross_validate(pipe, X_train, y_train, cv=kfold, scoring=scoring, verbose=3)
我有一个包含 17 个特征 (x) 和二进制分类结果 (y) 的数据集。我已经准备好数据集并对其执行 train_test_split()
。我正在使用以下脚本对数据集上的 运行 不同 ML 算法进行比较:
def run_exps(X_train: pd.DataFrame , y_train: pd.DataFrame, X_test: pd.DataFrame, y_test: pd.DataFrame) -> pd.DataFrame:
# Lightweight script to test many models and find winners
# :param X_train: training split
# :param y_train: training target vector
# :param X_test: test split
# :param y_test: test target vector
# :return: DataFrame of predictions
models = [
('LogReg', LogisticRegression()),
('RF', RandomForestClassifier()),
('KNN - Euclidean', KNeighborsClassifier(metric='euclidean')),
('SVM', SVC()),
('XGB', XGBClassifier(use_label_encoder =False, eval_metric='error'))
]
names = []
scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc']
# For Loop that takes each model and perform training, cross validation, prediction and evaluation
for name, model in models:
# Making pipleline that normalize, oversmaple the dataset
pipe = Pipeline([
('normalization', MinMaxScaler()),
('oversampling', SMOTE())
])
kfold = StratifiedKFold(n_splits=5)
# How can I call the pipeline inside the cross_validate() Function ?
cv_results = cross_validate(model, X_train, y_train, cv=kfold, scoring=scoring, verbose=3)
clf = model.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('''
{}
{}
{}
''' .format(name, classification_report(y_test, y_pred), confusion_matrix(y_test, y_pred)))
names.append(name)
我注意到在 运行 脚本之前需要对我使用的数据进行归一化和过采样。
但是,由于我在脚本中使用了 cross_validate()
函数,因此我需要在每次折叠时执行归一化和过采样。
为了做到这一点,我在 for 循环(采用每个模型并执行训练、交叉验证、预测和评估)内创建了一个管道(对数据集进行标准化和过采样),但我不确定如何调用管道,因为 cross_validate()
中的 estimator
参数已经采用 model
变量来执行基于它的预测。
遇到这种情况我该怎么办?
您可以将您的模型集成到您的管道中,然后按如下方式在您的管道上调用 cross_validate
:
pipe = Pipeline([
('normalization', MinMaxScaler()),
('oversampling', SMOTE()),
('name', model)
])
cv_results = cross_validate(pipe, X_train, y_train, cv=kfold, scoring=scoring, verbose=3)