scikit-learn 随机森林过度使用内存

Question

我是运行 scikit-learn（版本 0.15.2）随机森林 python 3.4 in windows 7 64 位。我有这个非常简单的模型：

import numpy as np
from sklearn.ensemble import RandomForestClassifier

#Data=np.genfromtxt('C:/Data/Tests/Train.txt', delimiter=',')

print ("nrows = ", Data.shape[0], "ncols = ", Data.shape[1]) 
X=np.float32(Data[:,1:])
Y=np.int16(Data[:,0])
RF = RandomForestClassifier(n_estimators=1000)
RF.fit(X, Y)

X 数据集包含大约 30,000 x 500 个元素，格式如下：

139.2398242257808,310.7242684642465,...

即使没有并行处理，内存使用量最终也会爬升至 16 GB！我想知道为什么会占用这么多内存。

我知道之前有人问过这个问题，但是在 0.15.2 版本之前...

有什么建议吗？

Answer 1

尝试通过设置较小的 n_estimators 参数来减少树的数量。然后，您可以尝试使用 max_depth 或 min_samples_split 来控制树的深度，并用深度换取更多的估计量。

Answer 2

不幸的是，内存消耗与类的数量呈线性关系。由于您有 100 个，并且样本数量相当可观，因此内存爆炸也就不足为奇了。解决方案包括控制树的大小（max_depth、min_samples_leaf、...）、树的数量（n_estimators）或减少问题中类的数量，如果可能的话。

scikit-learn 随机森林过度使用内存

scikit-learn Random Forest excessive memory usage

python

random-forest

scikit-learn