使用随机森林的 AUC-base 特征重要性

Question

我正在尝试使用随机森林和逻辑回归来预测二元变量。我严重失衡类（Y=1 的大约 1.5%）。

随机森林中的默认特征重要性技术基于分类准确度（错误率）——这已被证明是不平衡类的不良衡量标准（参见 here and here）。

The two standard VIMs for feature selection with RF are the Gini VIM and the permutation VIM. Roughly speaking the Gini VIM of a predictor of interest is the sum over the forest of the decreases of Gini impurity generated by this predictor whenever it was selected for splitting, scaled by the number of trees.

我的问题是：这种方法是在 scikit-learn 中实现的（就像在 R 包中一样 party）？或者可能是解决方法？

PS ：这个问题与 an other.

有某种联系

Answer 1

scoring只是一个用于测试样本的性能评估工具，它不会在每个分裂节点进入内部DecisionTreeClassifier算法。对于树算法，您只能将 criterion（每个拆分节点处的内部损失函数的一种）指定为 gini 或 information entropy。

scoring 可用于交叉验证上下文，其目标是调整一些超参数（如 max_depth）。在您的情况下，您可以使用 GridSearchCV 使用评分函数 roc_auc.

调整一些超参数

Answer 2

经过一些研究，这就是我得出的结论：

from sklearn.cross_validation import ShuffleSplit
from collections import defaultdict

names = db_train.iloc[:,1:].columns.tolist()

# -- Gridsearched parameters
model_rf = RandomForestClassifier(n_estimators=500,
                                 class_weight="auto",
                                 criterion='gini',
                                 bootstrap=True,
                                 max_features=10,
                                 min_samples_split=1,
                                 min_samples_leaf=6,
                                 max_depth=3,
                                 n_jobs=-1)
scores = defaultdict(list)

# -- Fit the model (could be cross-validated)
rf = model_rf.fit(X_train, Y_train)
acc = roc_auc_score(Y_test, rf.predict(X_test))

for i in range(X_train.shape[1]):
    X_t = X_test.copy()
    np.random.shuffle(X_t[:, i])
    shuff_acc = roc_auc_score(Y_test, rf.predict(X_t))
    scores[names[i]].append((acc-shuff_acc)/acc)

print("Features sorted by their score:")
print(sorted([(round(np.mean(score), 4), feat) for
              feat, score in scores.items()], reverse=True))

Features sorted by their score:
[(0.0028999999999999998, 'Var1'), (0.0027000000000000001, 'Var2'), (0.0023999999999999998, 'Var3'), (0.0022000000000000001, 'Var4'), (0.0022000000000000001, 'Var5'), (0.0022000000000000001, 'Var6'), (0.002, 'Var7'), (0.002, 'Var8'), ...]

输出不是很性感，但你明白了。这种方法的缺点是特征重要性 似乎非常依赖参数 。我运行它使用了不同的参数（max_depth、max_features..），我得到了很多不同的结果。所以我决定运行对参数 (scoring = 'roc_auc') 进行网格搜索，然后将此 VIM（变量重要性度量）应用于最佳模型。

我的灵感来自这个（很棒）notebook。

欢迎所有 suggestions/comments ！

使用随机森林的 AUC-base 特征重要性

AUC-base Features Importance using Random Forest

python

scoring

machine-learning

scikit-learn