如何在交叉验证和 GridSearchCV 中实现 SMOTE
How to implement SMOTE in cross validation and GridSearchCV
我对 Python 比较陌生。您能帮我改进 SMOTE 的实施,使其成为合适的管道吗?我想要的是在每次 k 次迭代的训练集上应用过采样和欠采样,以便模型在平衡数据集上进行训练,并在不平衡的遗漏部分上进行评估。问题是,当我这样做时,我无法使用熟悉的 sklearn
界面进行评估和网格搜索。
是否可以制作类似于model_selection.RandomizedSearchCV
的东西。我对此的看法:
df = pd.read_csv("Imbalanced_data.csv") #Load the data set
X = df.iloc[:,0:64]
X = X.values
y = df.iloc[:,64]
y = y.values
n_splits = 2
n_measures = 2 #Recall and AUC
kf = StratifiedKFold(n_splits=n_splits) #Stratified because we need balanced samples
kf.get_n_splits(X)
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
s =(n_splits,n_measures)
scores = np.zeros(s)
for train_index, test_index in kf.split(X,y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
sm = SMOTE(ratio = 'auto',k_neighbors = 5, n_jobs = -1)
smote_enn = SMOTEENN(smote = sm)
x_train_res, y_train_res = smote_enn.fit_sample(X_train, y_train)
clf_rf.fit(x_train_res, y_train_res)
y_pred = clf_rf.predict(X_test,y_test)
scores[test_index,1] = recall_score(y_test, y_pred)
scores[test_index,2] = auc(y_test, y_pred)
这看起来符合要求http://contrib.scikit-learn.org/imbalanced-learn/stable/generated/imblearn.over_sampling.SMOTE.html
您需要创建自己的 transformer
(http://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html),调用 fit
returns 平衡数据集(大概是从 StratifiedKFold
),但在调用 predict
时,测试数据将发生这种情况,调用 SMOTE。
您需要查看管道对象。不平衡学习有一个 Pipeline 扩展了 scikit-learn 管道,除了 fit_predict(),fit_transform 之外还适应 fit_sample() 和 sample() 方法scikit-learn 的 () 和 predict() 方法。
在此处查看此示例:
对于您的代码,您可能希望这样做:
from imblearn.pipeline import make_pipeline, Pipeline
smote_enn = SMOTEENN(smote = sm)
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
pipeline = make_pipeline(smote_enn, clf_rf)
OR
pipeline = Pipeline([('smote_enn', smote_enn),
('clf_rf', clf_rf)])
然后您可以将此pipeline
对象作为常规对象传递给scikit-learn中的GridSearchCV、RandomizedSearchCV或其他交叉验证工具。
kf = StratifiedKFold(n_splits=n_splits)
random_search = RandomizedSearchCV(pipeline, param_distributions=param_dist,
n_iter=1000,
cv = kf)
我对 Python 比较陌生。您能帮我改进 SMOTE 的实施,使其成为合适的管道吗?我想要的是在每次 k 次迭代的训练集上应用过采样和欠采样,以便模型在平衡数据集上进行训练,并在不平衡的遗漏部分上进行评估。问题是,当我这样做时,我无法使用熟悉的 sklearn
界面进行评估和网格搜索。
是否可以制作类似于model_selection.RandomizedSearchCV
的东西。我对此的看法:
df = pd.read_csv("Imbalanced_data.csv") #Load the data set
X = df.iloc[:,0:64]
X = X.values
y = df.iloc[:,64]
y = y.values
n_splits = 2
n_measures = 2 #Recall and AUC
kf = StratifiedKFold(n_splits=n_splits) #Stratified because we need balanced samples
kf.get_n_splits(X)
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
s =(n_splits,n_measures)
scores = np.zeros(s)
for train_index, test_index in kf.split(X,y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
sm = SMOTE(ratio = 'auto',k_neighbors = 5, n_jobs = -1)
smote_enn = SMOTEENN(smote = sm)
x_train_res, y_train_res = smote_enn.fit_sample(X_train, y_train)
clf_rf.fit(x_train_res, y_train_res)
y_pred = clf_rf.predict(X_test,y_test)
scores[test_index,1] = recall_score(y_test, y_pred)
scores[test_index,2] = auc(y_test, y_pred)
这看起来符合要求http://contrib.scikit-learn.org/imbalanced-learn/stable/generated/imblearn.over_sampling.SMOTE.html
您需要创建自己的 transformer
(http://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html),调用 fit
returns 平衡数据集(大概是从 StratifiedKFold
),但在调用 predict
时,测试数据将发生这种情况,调用 SMOTE。
您需要查看管道对象。不平衡学习有一个 Pipeline 扩展了 scikit-learn 管道,除了 fit_predict(),fit_transform 之外还适应 fit_sample() 和 sample() 方法scikit-learn 的 () 和 predict() 方法。
在此处查看此示例:
对于您的代码,您可能希望这样做:
from imblearn.pipeline import make_pipeline, Pipeline
smote_enn = SMOTEENN(smote = sm)
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
pipeline = make_pipeline(smote_enn, clf_rf)
OR
pipeline = Pipeline([('smote_enn', smote_enn),
('clf_rf', clf_rf)])
然后您可以将此pipeline
对象作为常规对象传递给scikit-learn中的GridSearchCV、RandomizedSearchCV或其他交叉验证工具。
kf = StratifiedKFold(n_splits=n_splits)
random_search = RandomizedSearchCV(pipeline, param_distributions=param_dist,
n_iter=1000,
cv = kf)