随机森林：平衡测试集？

Question

我正在尝试运行不平衡数据集 (~1:4) 上的随机森林分类器。

我正在使用imblearn的方法如下：

from imblearn.ensemble import BalancedRandomForestClassifier

rf=BalancedRandomForestClassifier(n_estimators=1000,random_state=42,class_weight='balanced',sampling_strategy='not minority')
rf.fit(train_features,train_labels) 
predictions=rf.predict(test_features)

使用来自 scikit learn 的 RepeatedStratifiedKFold 在交叉验证方法中执行训练集和测试集的拆分。

但是，我想知道是否也需要平衡测试集以获得合理的准确度分数（灵敏度、特异性等）。我希望你能帮助我。

非常感谢！

Answer 1

来自 imblearn docs:

A balanced random forest randomly under-samples each bootstrap sample to balance it.

如果您同意随机欠采样作为您的平衡方法，那么 classifier 会为您做到这一点 "under the hood"。事实上，这就是首先使用 imblearn 来处理 class 不平衡的意义所在。如果您使用的是直接随机森林，例如 sklearn 的开箱即用版本，那么我会更关心处理前端的 class 不平衡问题。

随机森林：平衡测试集？

Random forest: balancing test set?

python

random-forest

imblearn