scikit-learn RandomForestClassifier 中的子样本大小
Subsample size in scikit-learn RandomForestClassifier
如何控制用于训练森林中每棵树的子样本的大小?
根据 scikit-learn 的文档:
A random forest is a meta estimator that fits a number of decision
tree classifiers on various sub-samples of the dataset and use
averaging to improve the predictive accuracy and control over-fitting.
The sub-sample size is always the same as the original input sample
size but the samples are drawn with replacement if bootstrap=True
(default).
所以bootstrap
允许随机性但找不到如何控制子样本的数量
Scikit-learn 不提供此选项,但您可以使用(较慢的)版本结合树和装袋元分类器轻松获得此选项:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(), max_samples=0.5)
附带说明一下,Breiman 的随机森林确实没有将子样本视为参数,完全依赖于 bootstrap,因此大约 (1 - 1 / e) 个样本用于构建每棵树.
你实际上可以修改forest.py中的_generate_sample_indices函数来每次改变子样本的大小,感谢fastai lib 来实现一个函数 set_rf_samples 为此目的,它看起来像
def set_rf_samples(n):
""" Changes Scikit learn's random forests to give each tree a random sample of
n random rows.
"""
forest._generate_sample_indices = (lambda rs, n_samples:
forest.check_random_state(rs).randint(0, n_samples, n))
您可以将此函数添加到您的代码中
如何控制用于训练森林中每棵树的子样本的大小? 根据 scikit-learn 的文档:
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).
所以bootstrap
允许随机性但找不到如何控制子样本的数量
Scikit-learn 不提供此选项,但您可以使用(较慢的)版本结合树和装袋元分类器轻松获得此选项:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(), max_samples=0.5)
附带说明一下,Breiman 的随机森林确实没有将子样本视为参数,完全依赖于 bootstrap,因此大约 (1 - 1 / e) 个样本用于构建每棵树.
你实际上可以修改forest.py中的_generate_sample_indices函数来每次改变子样本的大小,感谢fastai lib 来实现一个函数 set_rf_samples 为此目的,它看起来像
def set_rf_samples(n):
""" Changes Scikit learn's random forests to give each tree a random sample of
n random rows.
"""
forest._generate_sample_indices = (lambda rs, n_samples:
forest.check_random_state(rs).randint(0, n_samples, n))
您可以将此函数添加到您的代码中