在 scikit-learn 中实现 R 随机森林特征重要性评分

implementation of R random forest feature importance score in scikit-learn

我正在尝试在 sklearn 中为随机森林回归模型实现 R 的特征重要性评分方法;根据 R 的文档:

The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences. If the standard deviation of the differences is equal to 0 for a variable, the division is not done (but the average is almost always equal to 0 in that case).

因此,如果我理解正确,我需要能够为每棵树中的 OOB 样本置换每个预测变量(特征)。

我知道我可以使用类似这样的东西访问经过训练的森林中的每棵树

numberTrees = 100
clf = RandomForestRegressor(n_estimators=numberTrees)
clf.fit(X,Y)
for tree in clf.estimators_:
    do something

有没有办法获得每棵树的 OOB 样本列表?也许我可以使用每棵树的 random_state 来导出 OOB 样本列表?

虽然 R 使用 OOB 样本,但我发现通过使用所有训练样本,我在 scikit 中得到了类似的结果。我正在执行以下操作:

# permute training data and score against its own model  
epoch = 3
seeds = range(epoch)


scores = defaultdict(list) # {feature: change in R^2}

# repeat process several times and then average and then average the score for each feature
for j in xrange(epoch):
    clf = RandomForestRegressor(n_jobs = -1, n_estimators = trees, random_state = seeds[j],
                               max_features = num_features, min_samples_leaf = leaf)

    clf = clf.fit(X_train, y_train)
    acc = clf.score(X_train, y_train)    

    print 'Epoch', j
    # for each feature, permute its values and check the resulting score
    for i, col in enumerate(X_train.columns):
        if i % 200 == 0: print "- feature %s of %s permuted" %(i, X_train.shape[1])
        X_train_copy = X_train.copy()
        X_train_copy[col] = np.random.permutation(X_train[col])
        shuff_acc = clf.score(X_train_copy, y_train)
        scores[col].append((acc-shuff_acc)/acc)

# get mean across epochs
scores_mean = {k: np.mean(v) for k, v in scores.iteritems()}

# sort scores (best first)
scores_sorted = pd.DataFrame.from_dict(scores_mean, orient='index').sort(0, ascending = False)