如何对随机森林进行交叉验证？

Question

我正在使用随机森林进行二元分类。我的数据集不平衡 77:23 比率。我的数据集形状是 (977, 7)

我最初尝试了以下

model = RandomForestClassifier(class_weight='balanced',max_depth=5,max_features='sqrt',n_estimators=300,random_state=24)
model.fit(X_train,y_train)
y_pred = mode.predict(X_test)

但是，现在我想在随机森林训练期间应用交叉验证，然后使用该模型预测测试数据的 y 值。所以，我做了以下

model = RandomForestClassifier(class_weight='balanced',max_depth=5,max_features='sqrt',n_estimators=300,random_state=24)
scores = cross_val_score(model,X_train, y_train,cv=10, scoring='f1')
y_pred = cross_val_predict(model,X_test,cv=10)

如您所见，这是不正确的。我如何在训练随机森林期间应用交叉验证，然后使用该交叉验证模型正确预测 y_pred？

Answer 1

cross-validation的目的是模型检查，而不是模型构建。

一旦您使用 cross-validation 检查您获得了每个拆分的相似指标，您必须使用所有您的训练数据来训练您的模型。

Answer 2

您不能使用 'cross_val_score' 或 'cross_val_predict' 取回模型 post-cross-validation。否则，您可以使用下面的代码块，使用测试数据和验证数据计算每次折叠的 F1 分数。

from sklearn.model_selection import KFold
from sklearn.metrics import f1_score

k = 10
kf_10 = KFold(n_splits = k, random_state = 24)
model_rfc = RandomForestClassifier(class_weight='balanced',max_depth=5,max_features='sqrt',n_estimators=300,random_state=24)
rfc_f1_CV_list = []
rfc_f1_test_list = []

for train_index, test_index in kf_10.split(X_train):
    X_train_CV, X_test_CV = X[train_index], X[test_index]
    y_train_CV, y_test_CV = y[train_index], y[test_index]
    model_rfc.fit(X_train_CV, y_train_CV)

    #Target prediction & F1 score using the 10 rows left out from CV.
    y_pred_CV = model_rfc.predict(X_test_CV)
    rfc_f1_CV = f1_score(y_test_CV, y_pred_CV)
    rfc_f1_CV_list.append(rfc_f1_CV)

    #Target prediction & F1 score using the rows from your test split.
    y_pred_test = model_rfc.predict(X_test)
    rfc_f1_test = f1_score(y_test, y_pred_test)
    rfc_f1_test_list.append(rfc_f1_test)

您可以修改上面的代码以在给定的折叠处保存模型并在下面的代码片段中使用它：

y_pred = cross_val_predict(model, X_test, y_test, cv=10, scoring='f1') 
f1_score(y_test, y_pred, average='binary')

如何对随机森林进行交叉验证？

How to do cross-validation on random forest?

python

classification

machine-learning

prediction

random-forest