Python随机森林sklearn:卡了几个小时,怎么回事?

Python RandomForest sk-learn: stuck for a few hours, what is going on?

我得到了一个很大的输入向量。目前,它已经在 运行 calibrated_clf.fit(x_train, y_train) 停留了几个小时。我不知道程序是死了还是什么。如何在 calibrated_clf.fit(x_train, y_train) 函数调用中 运行 时打印出某种进度?

clf = ensemble.RandomForestClassifier(criterion = 'entropy', n_estimators = 350, max_features = 200,n_jobs=-1)
calibrated_clf = CalibratedClassifierCV(clf, method='isotonic')
print "Here 1"
calibrated_clf.fit(x_train, y_train)
print "Here 2"

x_train 是一个大小为 (51733, 250) 的向量。 我在打印输出上停留了 "Here 1" 几个小时。

显然,这可以通过在 CalibratedClassifierCV 源代码中插入一个打印输出来完成,它作为 sklearn 的一部分提供,但它需要一个人非常熟悉算法和实现。

因为你不需要知道拟合的确切进度,所以 work-around 是 subclass ndarray 并重载索引运算符 - 我假设 x_train 和 y_train 你传入的是ndarrays。因此,每次 CalibratedClassifierCV fit 方法迭代并尝试访问数据时,它都会调用您的自定义代码。例如:

import numpy as np

class array_plus(np.ndarray):
    def __getitem__(self, idx):
        print("array_plus indexing operator called")
        return np.ndarray.__getitem__(self, idx)

并且在将这些数据传递给 fit 方法之前,您可以 "convert"(正式地,Python 不支持 "casting")它们进入您的新 class:

new_x_train = array_plus(x_train)
new_y_train = array_plus(y_train)

calibrated_clf.fit(new_x_train, new_y_train)

您甚至可以在 subclass 中放置一个计数器来大致了解您的位置。

您可以简单地将 verbose 设置为高于 0 的值。

来自

from sklearn.externals import joblib
help(joblib.parallel)

verbose: int, optional The verbosity level: if non zero, progress messages are printed. Above 50, the output is sent to stdout. The frequency of the messages increases with the verbosity level. If it more than 10, all iterations are reported.

RandomForestClassifier 使用 joblib 库中的 parallel 函数。

import numpy
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier

n = 1000

X, y = make_blobs(n_samples=n)
X_train, y_train = X[0:n // 2], y[0:n // 2]
X_valid, y_valid = X[n // 2:], y[n // 2:]

clf = RandomForestClassifier(n_estimators=350, verbose=100)
clf.fit(X_train, y_train)

输出

building tree 1 of 350
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
building tree 2 of 350
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
building tree 3 of 350
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s remaining:    0.0s
building tree 4 of 350
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.0s remaining:    0.0s
building tree 5 of 350

[...]

building tree 100 of 350
building tree 101 of 350
building tree 102 of 350
building tree 103 of 350
building tree 104 of 350
[...]
building tree 350 of 350
[Parallel(n_jobs=1)]: Done 350 out of 350 | elapsed:    1.6s finished

如果问题出在您使用的树的数量上,这里有一个小技巧可以解决它:

您可以将参数warm_start更改为True。做如下:

# Start with 10 estimators
growing_rf = RandomForestClassifier(n_estimators=10, n_jobs=-1,  
                                    warm_start=True, random_state=42)
for i in range(35): # Let's suppose you want to add 340 more trees, to add up to 350
    growing_rf.fit(X_train, y_train)
    growing_rf.n_estimators += 10

最后,您可以使用包含 350 棵树的随机森林来预测您的测试数据。

growing_rf.predict_proba(X_test)))